I have faced with strange problem. I have a pretty (but not really) textual file .txt with size 55,1 MB (55 082 716 bytes) that contains HTML-type markdown like this:
<div id="lesson-archive" class="container"><div id="primary" class="content-area match-height"><div class="lesson-excerpt content-container"><article><p>We hear about climate change pretty much every day now. We see pictures of floods, fires and heatwaves on TV news. Scientists have just announced that July was the hottest month ever recorded. The scientists are from the National Oceanic and Atmospheric Administration (NOAA) in the USA. A spokesperson from NOAA said: "July is typically the world's warmest month of the year, but July 2021 outdid itself as the hottest July and hottest month ever." NOAA said Earth's land and ocean surface temperature in July was 0.93 degree Celsius higher than the 20th-century average of 15.8 degrees Celsius. The Northern Hemisphere was 1.54 degrees Celsius hotter than average.<br><br>
I would like to remove some elements by such regex: [^a-zA-Z.,!?-—() ]
Here is my code to solve this problem:
import re
with open('data.txt', 'a+') as f:
data = f.read()
edited_data = re.sub(r'[^a-zA-Z.,!?-—() ]', '', data)
f.write(edited_data)
And that cause an error:
io.UnsupportedOperation: not readable
There are some questions with similar problem but not in a+ mode. Why did I get this error?
I use Python 3.8 on Ubuntu 20.04
You cannot read the file with read() that is only available in default or r might want to do something like this
import re
with open('data.txt') as f:
data = f.read()
with open('data.txt', 'w') as f:
edited_data = re.sub(r'[^a-zA-Z.,!?-—() ]',"", data)
f.write(edited_data)
Related
I have the following text file as input
Patient Name: XXX,A
Date of Service: 12/12/2018
Speaker ID: 10531
Visit Start: 06/07/2018
Visit End: 06/18/2018
Recipient:
REQUESTING PHYSICIAN:
Mr.XXX
REASON FOR CONSULTATION:
Acute asthma.
HISTORY OF PRESENT ILLNESS:
The patient is a 64-year-old female who is well known to our practice. She has not been feeling well over the last 3 weeks and has been complaining of increasing shortness of breath, cough, wheezing, and chest tightness. She was prescribed systemic steroids and Zithromax. Her respiratory symptoms persisted; and subsequently, she went to Capital Health Emergency Room. She presented to the office again yesterday with increasing shortness of breath, chest tightness, wheezing, and cough productive of thick sputum. She also noted some low-grade temperature.
PAST MEDICAL HISTORY:
Remarkable for bronchial asthma, peptic ulcer disease, hyperlipidemia, coronary artery disease with anomalous coronary artery, status post tonsillectomy, appendectomy, sinus surgery, and status post rotator cuff surgery.
HOME MEDICATIONS:
Include;
1. Armodafinil.
2. Atorvastatin.
3. Bisoprolol.
4. Symbicort.
5. Prolia.
6. Nexium.
7. Gabapentin.
8. Synthroid.
9. Linzess_____.
10. Montelukast.
11. Domperidone.
12. Tramadol.
ALLERGIES:
1. CEPHALOSPORIN.
2. PENICILLIN.
3. SULFA.
SOCIAL HISTORY:
She is a lifelong nonsmoker.
PHYSICAL EXAMINATION:
GENERAL: Shows a pleasant 64-year-old female.
VITAL SIGNS: Blood pressure 108/56, pulse of 70, respiratory rate is 26, and pulse oximetry is 94% on room air. She is afebrile.
HEENT: Conjunctivae are pink. Oral cavity is clear.
CHEST: Shows increased AP diameter and decreased breath sounds with diffuse inspiratory and expiratory wheeze and prolonged expiratory phase.
CARDIOVASCULAR: Regular rate and rhythm.
ABDOMEN: Soft.
EXTREMITIES: Does not show any edema.
LABORATORY DATA:
Her INR is 1.1. Chemistry; sodium 139, potassium 3.3, chloride 106, CO2 of 25, BUN is 10, creatinine 0.74, and glucose is 110. BNP is 40. White count on admission 16,800; hemoglobin 12.5; and neutrophils 88%. Two sets of blood cultures are negative. CT scan of the chest is obtained, which is consistent with tree-in-bud opacities of the lung involving bilateral lower lobes with patchy infiltrate involving the right upper lobe. There is mild bilateral bronchial wall thickening.
IMPRESSION:
1. Acute asthma.
2. Community acquired pneumonia.
3. Probable allergic bronchopulmonary aspergillosis.
I want the text file to be converted as an excel file
Patient Name Date of Service Speaker ID Visit Start Visit End Recipient ..... IMPRESSION:
XYZ 2/27/2018 10101 06-07-2018 06/18/2018 NA ....... 1. Acute asthma.
2. Community
acquired
pneumonia.
3. Probable
allergic
I wrote the following code
with open('1.txt') as infile:
registrations = []
fields = OrderedDict()
d = {}
for line in infile:
line = line.strip()
if line:
key, value = [s.strip() for s in line.split(':', 1)]
d[key] = value
fields[key] = None
else:
if d:
registrations.append(d)
d = {}
else:
if d: # handle EOF
registrations.append(d)
with open('registrations.csv', 'w') as outfile:
writer = DictWriter(outfile, fieldnames=fields)
writer.writeheader()
writer.writerows(registrations)
I'm getting an error
ValueError: not enough values to unpack (expected 2, got 1)
I'm not sure what the error is saying. I searched through websites but could not find a solution. I tried editing the file to remove the space and tried the above code, it was working. But in real-time scenario there will be hundreds of thousands of files so manually editing every file to remove all the spaces is not possible.
Your particular error is likely from
key, value = [s.strip() for s in line.split(':', 1)]
Some of your lines don't have a colon, so there is only one value in your list, and we can't assign one value to the pair key, value.
For example:
line = 'this is some text with a : colon'
key, value = [s.strip() for s in line.split(':', 1)]
print(key)
print(value)
returns:
this is some text with a
colon
But you'll get your error with
line = 'this is some text without a colon'
key, value = [s.strip() for s in line.split(':', 1)]
print(key)
print(value)
Write a program that reads two country data files, worldpop.txt and worldarea.txt. Both files contain the same countries in the same order. Write a file density.txt that contains country names and population densities (people per square km).
worldpop.txt:
China 1415045928
India 1354051854
U.S. 326766748
Indonesia 266794980
Brazil 210867954
Pakistan 200813818
Nigeria 195875237
Bangladesh 166368149
Russia 143964709
Mexico 130759074
Japan 127185332
Ethiopia 107534882
Philippines 106512074
Egypt 99375741
Viet-Nam 96491146
DR-Congo 84004989
Germany 82293457
Iran 82011735
Turkey 81916871
Thailand 69183173
U.K. 66573504
France 65233271
Italy 59290969
worldarea.txt:
China 9388211
India 2973190
U.S. 9147420
Indonesia 1811570
Brazil 8358140
Pakistan 770880
Nigeria 910770
Bangladesh 130170
Russia 16376870
Mexico 1943950
Japan 364555
Ethiopia 1000000
Philippines 298170
Egypt 995450
Viet-Nam 310070
DR-Congo 2267050
Germany 348560
Iran 1628550
Turkey 769630
Thailand 510890
U.K. 241930
France 547557
Italy 294140
density.txt should be like this:
China 150.7258334947947
India 455.42055973550293
U.S. 35.72228540943785
Indonesia 147.27279652456156
Brazil 25.22905263611282
Pakistan 260.49945257368205
Nigeria 215.06553465748763
Bangladesh 1278.0836521471922
Russia 8.790734065789128
Mexico 67.26462820545795
Japan 348.8783091714556
Ethiopia 107.534882
Philippines 357.2192843009022
Egypt 99.8299673514491
Viet-Nam 311.1914922436869
DR-Congo 37.054757945347475
Germany 236.0955273123709
Iran 50.35874550980934
Turkey 106.43669165703
Thailand 135.41696451290883
U.K. 275.1767205389989
France 119.13512383185677
Italy 201.57397497790168
Program I write:
f=open('worldpop.txt','r')
f2=open('worldarea.txt','r')
out=open('density.txt','w')
for line1 in f: #finding country names
pos1=line1.find(' ')
country=line1[0:pos1]+'\n'
for line2 in f: #finding population numbers
pos2=line2.find(' ')
population=line2[pos2+1:]
for line3 in f2: #finding area numbers
pos3=line3.find(' ')
area=line3[pos3+1:]
for line1 in f: #writing density to a new file
density=population/area
out.write(density)
out.close()
f.close()
f2.close()
When I run the program, density.txt is empty. How can I fix this problem? Thanks.
Note: I know the alternative solutions of it, but I mainly want to use this method to solve it, so please do not use other methods.
You code doesn't enter to the last loop, because the f iterator is already exhausted at that point (you can iterate only once on each iterator and you did it already in your first loop).
Try to change your code in the way that you iterate each file only once.
By the way #1 - the same issue happens in your second loop.
By the way #2 - you should convert the extracted values of population and area to numeric types (float in your case) in order to divide them to find the density.
Files are stream. They got a "pointer" to where in the stream they are. If you read them until the end, you cannot read again unless you close/reopen the stream or seek to an earlier position => iterating f a second time does not work.
You dont need that, as your countries occure in the same order. You can simple read the population file (one line), strip the newline at the end, then read one line from the area file and extract the area - then you put both together and add a newline and write that into your output.
Example:
with open('worldpop.txt','w') as r:
r.write("""China 9388211
India 2973190
U.S. 9147420""")
with open('worldarea.txt','w') as f2:
f2.write("""China 150.7258334947947
India 455.42055973550293
U.S. 35.72228540943785""")
with open('worldpop.txt') as r, open('worldarea.txt') as f2, open('d.txt','w') as out:
for line1 in r: #finding country names
l = line1.strip()
area = next(f2).split()[-1] # next(.) reads one line for that file
out.write(f"{l} {area}\n")
with open("d.txt") as d:
print(d.read())
Output of combined file:
China 9388211 150.7258334947947
India 2973190 455.42055973550293
U.S. 9147420 35.72228540943785
If you want to do calculations, you need to convert the numberstrings into numbers:
with open('worldpop.txt','r') as r, open(
'worldarea.txt','r') as f2, open('density.txt','w') as out:
for line1 in r: #finding country names
l = line1.strip()
ppl = float(l.strip().split()[-1]) # only use the number, convert to float
area = float(next(f2).split()[-1]) # only use the number, convert to float
# how many ppl per 1 area?
d = ppl / area
out.write(f"{l} {area} {d:f}\n")
with open("density.txt") as d:
print(d.read())
To get how many ppl / area live:
China 9388211 150.7258334947947 62286.674967
India 2973190 455.42055973550293 6528.449224
U.S. 9147420 35.72228540943785 256070.402416
Here is your functionnal and tested code :
pop = open('worldpop.txt', 'r') #r is for read
area = open('worldarea.txt', 'r')
out = open('density.txt', 'w')
poplist = pop.readlines() #will read the file and put each line into a list (new element)
arealist = area.readlines()
output = []
for i in range(len(poplist)):
poplist[i] = poplist[i].split(" ")
#slices the elements into lists, so we get a list of lists
poplist[i][1] = poplist[i][1].replace("\n", "")
for i in range(len(arealist)):
arealist[i] = arealist[i].split(" ")
arealist[i][1] = arealist[i][1].replace("\n", "")
#two indexes : we're in a list of lists
for i in range(len(poplist)):
out.write(poplist[i][0]+" ")
out.write(str(int(poplist[i][1])/int(arealist[i][1]))+"\n")
out.close()
pop.close()
area.close()
It works ! Instead of your use of loops to read the lines, I just use "readlines" wich makes everything way more simple. Then I just got to clear the lists and get the output.
You can try this
Defining files names a variables
WORLD_POP_PATH = 'worldpop.txt'
WORLD_AREA_PATH = 'worldarea.txt'
OUTPUT_PATH = 'density.txt'
Rather than using file position for each country, it is more effective to use a dict with country name as the key.
def generate_dict(file_path):
with open(file_path, 'r') as f_world_pop:
split_line = lambda x: x.split()
return {
line[0]: line[1]
for line in
map(split_line, f_world_pop.readlines())
}
We generate the world population and world area dict respectively with the content of the archives
world_pop = generate_dict( WORLD_POP_PATH )
world_area = generate_dict( WORLD_AREA_PATH )
And we generate a density list containing the name and pop. density
density = [f'{country} {int(world_pop[country])/int(world_area[country])}'
for country in world_pop.keys() ]
Finally, we write the result to output file
with open(OUTPUT_PATH, 'w+') as f_dens:
f_dens.write('\n'.join(density))
Should prefer to use with instead of open
About with:What is the python keyword "with" used for?
The next method will iterate over an object instead of loading all file contents into memory
with open("./worldpop.txt") as f, open("./worldarea.txt") as f2, open("./density.txt","w") as out:
for line in f:
country, pop = line.split(" ")
_, area = next(f2).split(" ")
out.writelines(f"{country} {int(pop)/int(area)}\n")
I have a collection of text files that are of the form:
Sponsor : U of NC Charlotte
U N C C Station
Charlotte, NC 28223 704/597-2000
NSF Program : 1468 MANUFACTURING MACHINES & EQUIP
Fld Applictn: 0308000 Industrial Technology
56 Engineering-Mechanical
Program Ref : 9146,MANU,
Abstract :
9500390 Patterson This award supports a new concept in precision metrology,
the Extreme Ultraviolet Optics Measuring Machine (EUVOMM). The goals for this
system when used to measure optical surfaces are a diameter range of 250 mm
with a lateral accuracy of 3.3 nm rms, and a depth range of 7.5 mm w
there's more text above and below the snippet. I want to be able to do the following, for each text file:
store the NSF program, and Fld Applictn numbers in a list, and store the associated text in another list
so, in the above example I want the following, for the i-th text file:
y_num[i] = 1468, 0308000, 56
y_txt[i] = MANUFACTURING MACHINES & EQUIP, Industrial Technology, Engineering-Mechanical
Is there a clean way to do this in python? I prefer python since I am using os.walk to parse all the text files stored in subdirectories.
file = open( "file","r")
for line in file.readlines():
if "NSF" in line:
values= line.split(":")
elif "Fld" in line:
values1 = line.split(":")
So values and values1 has the specific values which you are intetested
You can try something like
yourtextlist = yourtext.split(':')
numbers = []
for slice in yourtextlist:
l = slice.split()
try:
numbers.append(int(l[0]))
except ValueError:
pass
I am trying to read a file which has format like below: It has two '\n' space in between every line.
Great tool for healing your life--if you are ready to change your beliefs!<br /><a href="http
Bought this book for a friend. I read it years ago and it is one of those books you keep forever. Love it!
I read this book many years ago and have heard Louise Hay speak a couple of times. It is a valuable read...
I am using below python code to read the line and convert it into Dataframe:
open_reviews = open("C:\\Downloads\\review_short.txt","r",encoding="Latin-1" ).read()
documents = []
for r in open_reviews.split('\n\n'):
documents.append(r)
df = pd.DataFrame(documents)
print(df.head())
The output I am getting is as below:
0 I was very inspired by Louise's Hay approach t...
1 \n You Can Heal Your Life by
2 \n I had an older version
3 \n I love Louise Hay and
4 \n I thought the book was exellent
Since I used two (\n), it gets appended at beginning of each line. Is there any other way to handle this, so that I get output as below:
0 I was very inspired by Louise's Hay approach t...
1 You Can Heal Your Life by
2 I had an older version
3 I love Louise Hay and
4 I thought the book was exellent
This appends every non-blank line.
filename = "..."
lines = []
with open(filename) as f:
for line in f:
line = line.strip()
if line:
lines.append(line)
>>> lines
['Great tool for healing your life--if you are ready to change your beliefs!<br /><a href="http',
'Bought this book for a friend. I read it years ago and it is one of those books you keep forever. Love it!',
'I read this book many years ago and have heard Louise Hay speak a couple of times. It is a valuable read...']
lines = pd.DataFrame(lines, columns=['my_text'])
>>> lines
my_text
0 Great tool for healing your life--if you are r...
1 Bought this book for a friend. I read it years...
2 I read this book many years ago and have heard...
Try using the .stip() method. It will remove any unnecessary whitespace characters from the beginning or end of a string.
You can use it like this:
for r in open_review.split('\n\n'):
documents.append(r.strip())
Use readlines() and clean the line with strip().
filename = "C:\\Downloads\\review_short.txt"
open_reviews = open(filename, "r", encoding="Latin-1")
documents = []
for r in open_reviews.readlines():
r = r.strip() # clean spaces and \n
if r:
documents.append(r)
Text File
• I.D.: AN000015544
DESCRIPTION: 6 1/2 DIGIT DIGITAL MULTIMETER
MANUFACTURER: HEWLETT-PACKARDMODEL NUM.: 34401A CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: MY45027398
• I.D.: AN000016955
DESCRIPTION: TEMPERATURE CALIBRATOR
MANUFACTURER: FLUKE MODEL NUM.: 724 CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: 1189063
• I.D.: AN000017259
DESCRIPTION: TRUE RMS MULTIMETER
MANUFACTURER: AGILENT MODEL NUM.: U1253A CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: MY49420076
Objective
To read the text file and save the ID number and Serial number of each part into the part_data data structure.
Data Structure
part_data = {'ID': [],
'Serial Number': []}
Code
with open("textfile", 'r') as part_info:
lineArray = part_info.read().split('\n')
print(lineArray)
if "• I.D.: AN000015544 " in lineArray:
print("I have found the first line")
ID = [s for s in lineArray if "AN" in s]
print(ID[0])
My code isn't finding the I.D: or the serial number value. I know it is wrong I was trying to use the method I got from this website Text File Reading and Printing Data for parsing the data. Can anyone move me in the right direction for collecting the values?
Update
This solution works with python 2.7.9 not 3.4, thanks to domino - https://stackoverflow.com/users/209361/domino:
with open("textfile", 'r') as part_info:
lineArray = part_info.readlines()
for line in lineArray:
if "I.D.:" in line:
ID = line.split(':')[1]
print ID.strip()
However when I initially asked the question I was using python 3.4, and the solution did not work properly.
Does anyone understand why it doesn't work python 3.4? Thank You!
This should print out all your ID's. I think it should move you in the right direction.
with open("textfile", 'r') as part_info:
lineArray = part_info.readlines()
for line in lineArray:
if "I.D.:" in line:
ID = line.split(':')[1]
print ID.strip()
It won't work in python3 because in python3 print is a function
It should end
print(ID.strip())