Convert Text to Excel

Convert Text to Excel - python

I have the following text file as input
Patient Name: XXX,A
Date of Service: 12/12/2018
Speaker ID: 10531
Visit Start: 06/07/2018
Visit End: 06/18/2018
Recipient:
REQUESTING PHYSICIAN:
Mr.XXX
REASON FOR CONSULTATION:
Acute asthma.
HISTORY OF PRESENT ILLNESS:
The patient is a 64-year-old female who is well known to our practice. She has not been feeling well over the last 3 weeks and has been complaining of increasing shortness of breath, cough, wheezing, and chest tightness. She was prescribed systemic steroids and Zithromax. Her respiratory symptoms persisted; and subsequently, she went to Capital Health Emergency Room. She presented to the office again yesterday with increasing shortness of breath, chest tightness, wheezing, and cough productive of thick sputum. She also noted some low-grade temperature.
PAST MEDICAL HISTORY:
Remarkable for bronchial asthma, peptic ulcer disease, hyperlipidemia, coronary artery disease with anomalous coronary artery, status post tonsillectomy, appendectomy, sinus surgery, and status post rotator cuff surgery.
HOME MEDICATIONS:
Include;
1. Armodafinil.
2. Atorvastatin.
3. Bisoprolol.
4. Symbicort.
5. Prolia.
6. Nexium.
7. Gabapentin.
8. Synthroid.
9. Linzess_____.
10. Montelukast.
11. Domperidone.
12. Tramadol.
ALLERGIES:
1. CEPHALOSPORIN.
2. PENICILLIN.
3. SULFA.
SOCIAL HISTORY:
She is a lifelong nonsmoker.
PHYSICAL EXAMINATION:
GENERAL: Shows a pleasant 64-year-old female.
VITAL SIGNS: Blood pressure 108/56, pulse of 70, respiratory rate is 26, and pulse oximetry is 94% on room air. She is afebrile.
HEENT: Conjunctivae are pink. Oral cavity is clear.
CHEST: Shows increased AP diameter and decreased breath sounds with diffuse inspiratory and expiratory wheeze and prolonged expiratory phase.
CARDIOVASCULAR: Regular rate and rhythm.
ABDOMEN: Soft.
EXTREMITIES: Does not show any edema.
LABORATORY DATA:
Her INR is 1.1. Chemistry; sodium 139, potassium 3.3, chloride 106, CO2 of 25, BUN is 10, creatinine 0.74, and glucose is 110. BNP is 40. White count on admission 16,800; hemoglobin 12.5; and neutrophils 88%. Two sets of blood cultures are negative. CT scan of the chest is obtained, which is consistent with tree-in-bud opacities of the lung involving bilateral lower lobes with patchy infiltrate involving the right upper lobe. There is mild bilateral bronchial wall thickening.
IMPRESSION:
1. Acute asthma.
2. Community acquired pneumonia.
3. Probable allergic bronchopulmonary aspergillosis.
I want the text file to be converted as an excel file
Patient Name Date of Service Speaker ID Visit Start Visit End Recipient ..... IMPRESSION:
XYZ 2/27/2018 10101 06-07-2018 06/18/2018 NA ....... 1. Acute asthma.
2. Community
acquired
pneumonia.
3. Probable
allergic
I wrote the following code
with open('1.txt') as infile:
registrations = []
fields = OrderedDict()
d = {}
for line in infile:
line = line.strip()
if line:
key, value = [s.strip() for s in line.split(':', 1)]
d[key] = value
fields[key] = None
else:
if d:
registrations.append(d)
d = {}
else:
if d: # handle EOF
registrations.append(d)
with open('registrations.csv', 'w') as outfile:
writer = DictWriter(outfile, fieldnames=fields)
writer.writeheader()
writer.writerows(registrations)
I'm getting an error
ValueError: not enough values to unpack (expected 2, got 1)
I'm not sure what the error is saying. I searched through websites but could not find a solution. I tried editing the file to remove the space and tried the above code, it was working. But in real-time scenario there will be hundreds of thousands of files so manually editing every file to remove all the spaces is not possible.

Your particular error is likely from
key, value = [s.strip() for s in line.split(':', 1)]
Some of your lines don't have a colon, so there is only one value in your list, and we can't assign one value to the pair key, value.
For example:
line = 'this is some text with a : colon'
key, value = [s.strip() for s in line.split(':', 1)]
print(key)
print(value)
returns:
this is some text with a
colon
But you'll get your error with
line = 'this is some text without a colon'
key, value = [s.strip() for s in line.split(':', 1)]
print(key)
print(value)

Related

io.UnsupportedOperation: not readable in a+ mode

I have faced with strange problem. I have a pretty (but not really) textual file .txt with size 55,1 MB (55 082 716 bytes) that contains HTML-type markdown like this:
<div id="lesson-archive" class="container"><div id="primary" class="content-area match-height"><div class="lesson-excerpt content-container"><article><p>We hear about climate change pretty much every day now. We see pictures of floods, fires and heatwaves on TV news. Scientists have just announced that July was the hottest month ever recorded. The scientists are from the National Oceanic and Atmospheric Administration (NOAA) in the USA. A spokesperson from NOAA said: "July is typically the world's warmest month of the year, but July 2021 outdid itself as the hottest July and hottest month ever." NOAA said Earth's land and ocean surface temperature in July was 0.93 degree Celsius higher than the 20th-century average of 15.8 degrees Celsius. The Northern Hemisphere was 1.54 degrees Celsius hotter than average.<br><br>
I would like to remove some elements by such regex: [^a-zA-Z.,!?-—() ]
Here is my code to solve this problem:
import re
with open('data.txt', 'a+') as f:
data = f.read()
edited_data = re.sub(r'[^a-zA-Z.,!?-—() ]', '', data)
f.write(edited_data)
And that cause an error:
io.UnsupportedOperation: not readable
There are some questions with similar problem but not in a+ mode. Why did I get this error?
I use Python 3.8 on Ubuntu 20.04

You cannot read the file with read() that is only available in default or r might want to do something like this
import re
with open('data.txt') as f:
data = f.read()
with open('data.txt', 'w') as f:
edited_data = re.sub(r'[^a-zA-Z.,!?-—() ]',"", data)
f.write(edited_data)

Obtain tsv from text with a specific pattern

I'm a biologist and I need to take information on a text file
I have a file with plain text like that:
12018411
Comparison of two timed artificial insemination (TAI) protocols for management of first insemination postpartum.
TAI|timed artificial insemination|0.999808
Two estrus-synchronization programs were compared and factors influencing their success over a year were evaluated. All cows received a setup injection of PGF2alpha at 39 +/- 3 d postpartum. Fourteen days later they received GnRH, followed in 7 d by a second injection of PGF2alpha. Cows (n = 523) assigned to treatment 1 (modified targeted breeding) were inseminated based on visual signs of estrus at 24, 48, or 72 h after the second PGF2alpha injection. Any cow not observed in estrus was inseminated at 72 h. Cows (n = 440) assigned to treatment 2 received a second GnRH injection 48 h after the second PGF2alpha, and all were inseminated 24 h later. Treatment, season of calving, multiple birth, estrual status at insemination, number of occurrences of estrus before second PGF2alpha, prophylactic use of PGF2alpha, retained fetal membranes, and occurrence of estrus following the setup PGF2alpha influenced success. Conception rate was 31.2% (treatment 1) and 29.1% (treatment 2). A significant interaction occurred between protocol and estrual status at insemination. Cows in estrus at insemination had a 45.8% (treatment 1) or 35.4% (treatment 2) conception rate. The conception rate for cows not expressing estrus at insemination was 19.2% (treatment 1) and 27.7% (treatment 2). Provided good estrous detection exists, modified targeted breeding can be as successful as other timed artificial insemination programs. Nutritional, environmental, and management strategies to reduce postpartum disorders and to minimize the duration of postpartum anestrus are critical if synchronization schemes are used to program first insemination after the voluntary waiting period.
8406022
Deletion of the beta-turn/alpha-helix motif at the exon 2/3 boundary of human c-Myc leads to the loss of its immortalizing function.
The protein product (c-Myc) of the human c-myc proto-oncogene carries a beta-turn/alpha-helix motif at the exon2/exon3 boundary. The amino acid (aa) sequence and secondary structure of this motif are highly conserved among several nuclearly localized oncogene products, c-Myc, N-Myc, c-Fos, SV40 large T and adenovirus (Ad) Ela. Removal of this region from Ad E1a results in the loss of the transforming properties of the virus without destroying its known transregulatory functions. In order to analyse whether deletion of the above-mentioned region from c-Myc has a similar effect on its transformation activity, we constructed a deletion mutant (c-myc delta) lacking the respective aa at the exon2/exon3 boundary. In contrast to the c-myc wild-type gene product, constitutive expression of c-myc delta does not lead to the immortalization of primary mouse embryo fibroblast cells (MEF cells). This result indicates that c-Myc and Ad El a share a common domain which is involved in the transformation process by both oncogenes.
aa|amino acid|0.99818
Ad|adenovirus|0.96935
MEF cells|mouse embryo fibroblast cells|0.994648
The first line is the id, the second line is the title, the third line used to be the abstract (sometimes there are abbreviations) and the lasts lines (if there are) are abbreviations with double space, the abbreviation, the meaning, and a number. You can see :
GA|general anesthesia|0.99818
Then there is a line in blank and start again: ID, Title, Abstract, Abbreviations or ID, Title, Abbreviations, Abstract.
And I need to take this data and convert to a TSV file like that:
12018411 TAI timed artificial insemination
8406022 aa amino acids
8406022 Ad adenovirus
... ... ...
First column ID, second column Abbreviation, and third column Meaning of this abbreviation.
I tried to convert first in a Dataframe and then convert to TSV but I don't know how take the information of the text with the structure I need.
And I tried with this code too:
from collections import namedtuple
import pandas as pd
Item= namedtuple('Item', 'ID')
items = []
with open("identify_abbr-out.txt", "r", encoding='UTF-8') as f:
lines= f.readlines()
for line in lines:
if line== '\n':
ID= ¿nextline?
if line.startswith(" "):
Abbreviation = line
items.append(Item(ID, Abbreviation))
df = pd.DataFrame.from_records(items, columns=['ID', 'Abbreviation'])
But I don't know how to read the next line and the code not found because there are some lines in blank in the middle between the corpus and the title sometimes.
I'm using python 3.8
Thank you very much in advance.

Assuming test.txt has your input data, I used simple file read functions to process the data -
file1 = open('test.txt', 'r')
Lines = file1.readlines()
outputlines = []
outputline=""
counter = 0
for l in Lines:
if l.strip()=="":
outputline = ""
counter = 0
elif counter==0:
outputline = outputline + l.strip() + "|"
counter = counter + 1
elif counter==1:
counter = counter + 1
else:
if len(l.split("|"))==3 and l[0:2]==" " :
outputlines.append(outputline + l.strip() +"\n")
counter = counter + 1
file1 = open('myfile.txt', 'w')
file1.writelines(outputlines)
file1.close()
Here file is read, line by line, a counter is kept and reset when there is a blank line, and ID is read in just next line. If there are 3 field "|" separated row, with two spaces in beginning, row is exported with ID

Exit code 137 (interrupted by signal 9: SIGKILL) when writing large dictionary into CSV

using the code below, I read a large amount of XML files (around 300.000) into a nested dictionary. I want to write this into a single CSV file. At first attempt I did this by using a pandas dataframe as intermediary. The dictionary is fully constructed, however during the last step, when converting into CSV I get exit code 137 (interrupted by signal 9: SIGKILL).
(I found that building a nested dictionary instead of appending a dataframe is by far the quickest option).
Any idea how I can manage to write into a single CSV by circumventing this error? Is there a way to free up some memory somewhere in between?
Thanks!
#Import packages.
import pandas as pd
from lxml import etree
import os
from os import listdir
from os.path import isfile, join
from tqdm import tqdm
from datetime import datetime
from collections import defaultdict
#Set options for displaying results
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
def run(file, content):
data = etree.parse(file)
#get all paths from the XML
get_path = lambda x: data.getpath(x)
paths = list(map(get_path, data.getroot().getiterator()))
content = ""
content = [
data.getroot().xpath(path)
for path in paths
]
get_text = lambda x: x.text
content = [list(map(get_text, i)) for i in content]
content = dict(zip(paths, content))
content = {
content["/clinical_study/id_info/nct_id"][0]: content
}
dict_final.update(content)
def write_csv(df_name, csv):
df_name.to_csv(csv, sep=";")
#######RUN######
mypath = '/Users/Documents/AllPublicXML'
folder_all = os.listdir(mypath)
dict_final = {}
df_final = pd.DataFrame()
for folder in tqdm(folder_all):
mypath2 = mypath + "/" + folder
print(folder)
if os.path.isdir(mypath2):
file = [f for f in listdir(mypath2) if isfile(join(mypath2, f))]
output = "./Output/" + folder + ".csv"
for x in tqdm(file):
dir = mypath2 + "/" + x
#output = "./Output/"+x+".csv"
dict_name = x.split(".", 1)[0]
try:
run(dir,dict_name)
except:
log = open("log.txt", "a+")
log.write(str(datetime.now()) + ": Error in file " +x+"\r \n")
pass
log = open("log.txt", "a+")
log.write(str(datetime.now()) + ": " + folder +" written succesfully \r \n")
df_final = pd.DataFrame.from_dict(dict_final, orient='index')
write_csv(df_final, "./Output/final_csv.csv")
log.close()
The XMLs look like this
<clinical_study>
<!--
This xml conforms to an XML Schema at:
https://clinicaltrials.gov/ct2/html/images/info/public.xsd
-->
<required_header>
<download_date>
ClinicalTrials.gov processed this data on March 20, 2020
</download_date>
<link_text>Link to the current ClinicalTrials.gov record.</link_text>
<url>https://clinicaltrials.gov/show/NCT03261284</url>
</required_header>
<id_info>
<org_study_id>2017-P-032</org_study_id>
<nct_id>NCT03261284</nct_id>
</id_info>
<brief_title>
D-dimer to Guide Anticoagulation Therapy in Patients With Atrial Fibrillation
</brief_title>
<acronym>DATA-AF</acronym>
<official_title>
D-dimer to Determine Intensity of Anticoagulation to Reduce Clinical Outcomes in Patients With Atrial Fibrillation
</official_title>
<sponsors>
<lead_sponsor>
<agency>Wuhan Asia Heart Hospital</agency>
<agency_class>Other</agency_class>
</lead_sponsor>
</sponsors>
<source>Wuhan Asia Heart Hospital</source>
<oversight_info>
<has_dmc>Yes</has_dmc>
<is_fda_regulated_drug>No</is_fda_regulated_drug>
<is_fda_regulated_device>No</is_fda_regulated_device>
</oversight_info>
<brief_summary>
<textblock>
This was a prospective, three arms, randomized controlled study.
</textblock>
</brief_summary>
<detailed_description>
<textblock>
D-dimer testing is performed in AF Patients receiving warfarin therapy (target INR:1.5-2.5) in Wuhan Asia Heart Hospital. Patients with elevated d-dimer levels (>0.5ug/ml FEU) were SCREENED AND RANDOMIZED to three groups at a ratio of 1:1:1. First, NOAC group,the anticoagulant was switched to Dabigatran (110mg,bid) when elevated d-dimer level was detected during warfarin therapy.Second,Higher-INR group, INR was adjusted to higher level (INR:2.0-3.0) when elevated d-dimer level was detected during warfarin therapy. Third, control group, patients with elevated d-dimer levels have no change in warfarin therapy. Warfarin is monitored once a month by INR ,and dabigatran dose not need monitor. All patients were followed up for 24 months until the occurrence of endpoints, including bleeding events, thrombotic events and all-cause deaths.
</textblock>
</detailed_description>
<overall_status>Enrolling by invitation</overall_status>
<start_date type="Anticipated">March 1, 2019</start_date>
<completion_date type="Anticipated">May 30, 2020</completion_date>
<primary_completion_date type="Anticipated">February 28, 2020</primary_completion_date>
<phase>N/A</phase>
<study_type>Interventional</study_type>
<has_expanded_access>No</has_expanded_access>
<study_design_info>
<allocation>Randomized</allocation>
<intervention_model>Parallel Assignment</intervention_model>
<primary_purpose>Treatment</primary_purpose>
<masking>None (Open Label)</masking>
</study_design_info>
<primary_outcome>
<measure>Thrombotic events</measure>
<time_frame>24 months</time_frame>
<description>
Stroke, DVT, PE, Peripheral arterial embolism, ACS etc.
</description>
</primary_outcome>
<primary_outcome>
<measure>hemorrhagic events</measure>
<time_frame>24 months</time_frame>
<description>cerebral hemorrhage,Gastrointestinal bleeding etc.</description>
</primary_outcome>
<secondary_outcome>
<measure>all-cause deaths</measure>
<time_frame>24 months</time_frame>
</secondary_outcome>
<number_of_arms>3</number_of_arms>
<enrollment type="Anticipated">600</enrollment>
<condition>Atrial Fibrillation</condition>
<condition>Thrombosis</condition>
<condition>Hemorrhage</condition>
<condition>Anticoagulant Adverse Reaction</condition>
<arm_group>
<arm_group_label>DOAC group</arm_group_label>
<arm_group_type>Experimental</arm_group_type>
<description>
Patients with elevated d-dimer levels was switched to DOAC (dabigatran 150mg, bid).
</description>
</arm_group>
<arm_group>
<arm_group_label>Higher-INR group</arm_group_label>
<arm_group_type>Experimental</arm_group_type>
<description>
Patients' target INR was adjusted from 1.5-2.5 to 2.0-3.0 by adding warfarin dose.
</description>
</arm_group>
<arm_group>
<arm_group_label>Control group</arm_group_label>
<arm_group_type>No Intervention</arm_group_type>
<description>
Patients continue previous strategy without change.
</description>
</arm_group>
<intervention>
<intervention_type>Drug</intervention_type>
<intervention_name>Dabigatran Etexilate 150 MG [Pradaxa]</intervention_name>
<description>Dabigatran Etexilate 150mg,bid</description>
<arm_group_label>DOAC group</arm_group_label>
<other_name>Pradaxa</other_name>
</intervention>
<intervention>
<intervention_type>Drug</intervention_type>
<intervention_name>Warfarin Pill</intervention_name>
<description>Add warfarin dose according to INR values.</description>
<arm_group_label>Higher-INR group</arm_group_label>
</intervention>
<eligibility>
<criteria>
<textblock>
Inclusion Criteria: - Patients with non-valvular atrial fibrillation - Receiving warfarin therapy Exclusion Criteria: - Patients who had suffered from recent (within 3 months) myocardial infarction, ischemic stroke, deep vein thrombosis, cerebral hemorrhages, or other serious diseases. - Those who had difficulty in compliance or were unavailable for follow-up.
</textblock>
</criteria>
<gender>All</gender>
<minimum_age>18 Years</minimum_age>
<maximum_age>75 Years</maximum_age>
<healthy_volunteers>No</healthy_volunteers>
</eligibility>
<overall_official>
<last_name>Zhenlu ZHANG, MD,PhD</last_name>
<role>Study Director</role>
<affiliation>Wuhan Asia Heart Hospital</affiliation>
</overall_official>
<location>
<facility>
<name>Zhang litao</name>
<address>
<city>Wuhan</city>
<state>Hubei</state>
<zip>430022</zip>
<country>China</country>
</address>
</facility>
</location>
<location_countries>
<country>China</country>
</location_countries>
<verification_date>March 2019</verification_date>
<study_first_submitted>August 22, 2017</study_first_submitted>
<study_first_submitted_qc>August 23, 2017</study_first_submitted_qc>
<study_first_posted type="Actual">August 24, 2017</study_first_posted>
<last_update_submitted>March 6, 2019</last_update_submitted>
<last_update_submitted_qc>March 6, 2019</last_update_submitted_qc>
<last_update_posted type="Actual">March 7, 2019</last_update_posted>
<responsible_party>
<responsible_party_type>Sponsor</responsible_party_type>
</responsible_party>
<keyword>D-dimer</keyword>
<keyword>Nonvalvular atrial fibrillation</keyword>
<keyword>Direct thrombin inhibitor</keyword>
<keyword>INR</keyword>
<condition_browse>
<!--
CAUTION: The following MeSH terms are assigned with an imperfect algorithm
-->
<mesh_term>Atrial Fibrillation</mesh_term>
<mesh_term>Thrombosis</mesh_term>
<mesh_term>Hemorrhage</mesh_term>
</condition_browse>
<intervention_browse>
<!--
CAUTION: The following MeSH terms are assigned with an imperfect algorithm
-->
<mesh_term>Warfarin</mesh_term>
<mesh_term>Dabigatran</mesh_term>
<mesh_term>Fibrin fragment D</mesh_term>
</intervention_browse>
<!--
Results have not yet been posted for this study
-->
</clinical_study>

Sklearn CountVectorizer "Empty Vocabulary" error in Dataframe when computing nGram

I have a dataframe (data) with 3 records:
id text
0001 The farmer plants grain
0002 The fisher catches tuna
0003 The police officer fights crime
I group that dataframe by id:
data_grouped = data.groupby('id')
Describing the resulting groupby object shows that all the records remain.
I then run this code to find the nGrams in the text and join them to the id:
word_vectorizer = CountVectorizer(stop_words=None, ngram_range=(2,2),
analyzer='word')
for id, group in data_grouped:
X = word_vectorizer.fit_transform(group['text'])
frequencies = sum(X).toarray()[0]
results = pd.DataFrame(frequencies, columns=['frequency'])
dfinner = pd.DataFrame(word_vectorizer.get_feature_names())
dfinner['id'] = id
final = results.join(dfinner)
When I run all of this code together, an error kicks out for the word_vectorizer that states "empty vocabulary; perhaps the documents only contain stop words". I know this error has been mentioned in many other questions, but I couldn't find one that deals with a Dataframe.
To further complicate the issue, the error doesn't always show up. I am pulling the data from a SQL DB, and depending on how many records I pull in, the error may or may not show up. For instance, pulling in Top 10 records causes the error, but Top 5 doesn't.
EDIT:
Complete Traceback
Traceback (most recent call last):
File "<ipython-input-63-d261e44b8cce>", line 1, in <module>
runfile('C:/Users/taca/Documents/Work/Python/Text Analytics/owccomments.py', wdir='C:/Users/taca/Documents/Work/Python/Text Analytics')
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/taca/Documents/Work/Python/Text Analytics/owccomments.py", line 38, in <module>
X = word_vectorizer.fit_transform(group['cleanComments'])
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 839, in fit_transform
self.fixed_vocabulary_)
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 781, in _count_vocab
raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words

I see what is going on here, but in running through it I have a nagging question. Why are you doing this? I'm not quite sure I understand the value of fitting the CountVectorizer to each document in a collection of documents. Generally the idea is to fit it to the entire corpus and then do you analysis from there. I get that maybe you want to be able to see what grams exist in each document but there are other, much easier and optimized, ways of doing this. For example:
df = pd.DataFrame({'id': [1,2,3], 'text': ['The farmer plants grain', 'The fisher catches tuna', 'The police officer fights crime']})
cv = CountVectorizer(stop_words=None, ngram_range=(2,2), analyzer='word')
dt_mat = cv.fit_transform(df.text)
print(cv.get_feature_names())
['catches tuna',
'farmer plants',
'fights crime',
'fisher catches',
'officer fights',
'plants grain',
'police officer',
'the farmer',
'the fisher',
'the police']
print(dt_mat.todense())
[[0 1 0 0 0 1 0 1 0 0]
[1 0 0 1 0 0 0 0 1 0]
[0 0 1 0 1 0 1 0 0 1]]
Great so there you can see the features extracted by the CountVectorizer and the matrix representation of what features exist in each document. dt_mat is the Document Term Matrix and represents the count of each gram (the frequency) in the vocabulary (the features) for each document. To map this back to the grams, and even place it into your DataFrame you can do the following:
df['grams'] = cv.inverse_transform(dt_mat)
print(df)
id text \
0 1 The farmer plants grain
1 2 The fisher catches tuna
2 3 The police officer fights crime
grams
0 [plants grain, farmer plants, the farmer]
1 [catches tuna, fisher catches, the fisher]
2 [fights crime, officer fights, police officer,...
Personally this feels more meaningful, because you are fitting the CountVectorizer to the entire corpus and not just a single document at a time. You can still extract the same information (the frequency and the grams) and this will be much faster as you scale up in documents.

extracting part of text from file in python

I have a collection of text files that are of the form:
Sponsor : U of NC Charlotte
U N C C Station
Charlotte, NC 28223 704/597-2000
NSF Program : 1468 MANUFACTURING MACHINES & EQUIP
Fld Applictn: 0308000 Industrial Technology
56 Engineering-Mechanical
Program Ref : 9146,MANU,
Abstract :
9500390 Patterson This award supports a new concept in precision metrology,
the Extreme Ultraviolet Optics Measuring Machine (EUVOMM). The goals for this
system when used to measure optical surfaces are a diameter range of 250 mm
with a lateral accuracy of 3.3 nm rms, and a depth range of 7.5 mm w
there's more text above and below the snippet. I want to be able to do the following, for each text file:
store the NSF program, and Fld Applictn numbers in a list, and store the associated text in another list
so, in the above example I want the following, for the i-th text file:
y_num[i] = 1468, 0308000, 56
y_txt[i] = MANUFACTURING MACHINES & EQUIP, Industrial Technology, Engineering-Mechanical
Is there a clean way to do this in python? I prefer python since I am using os.walk to parse all the text files stored in subdirectories.

file = open( "file","r")
for line in file.readlines():
if "NSF" in line:
values= line.split(":")
elif "Fld" in line:
values1 = line.split(":")
So values and values1 has the specific values which you are intetested

You can try something like
yourtextlist = yourtext.split(':')
numbers = []
for slice in yourtextlist:
l = slice.split()
try:
numbers.append(int(l[0]))
except ValueError:
pass

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.