Downloading a file into memory - python

I am writing a python script and I just need the second line of a series of very small text files. I would like to extract this without saving the file to my harddrive as I currently do.
I have found a few threads that reference the TempFile and StringIO modules but I was unable to make much sense of them.
Currently I download all of the files and name them sequentially like 1.txt, 2.txt, etc, then go through all of them and extract the second line. I would like to open the file grab the line then move on to finding and opening and reading the next file.
Here is what I do currently with writing it to my HDD:
while (count4 <= num_files):
file_p = [directory,str(count4),'.txt']
file_path = ''.join(file_p)
cand_summary = string.strip(linecache.getline(file_path, 2))
linkFile = open('Summary.txt', 'a')
linkFile.write(cand_summary)
linkFile.write("\n")
count4 = count4 + 1
linkFile.close()

Just replace the file writing with a call to append() on a list. For example:
summary = []
while (count4 <= num_files):
file_p = [directory,str(count4),'.txt']
file_path = ''.join(file_p)
cand_summary = string.strip(linecache.getline(file_path, 2))
summary.append(cand_summary)
count4 = count4 + 1
As an aside you would normally write count += 1. Also it looks like count4 uses 1-based indexing. That seems pretty unusual for Python.

You open and close the output file in every iteration.
Why not simply do
with open("Summary.txt", "w") as linkfile:
while (count4 <= num_files):
file_p = [directory,str(count4),'.txt']
file_path = ''.join(file_p)
cand_summary = linecache.getline(file_path, 2).strip() # string module is deprecated
linkFile.write(cand_summary)
linkFile.write("\n")
count4 = count4 + 1
Also, linecache is probably not the right tool here since it's optimized for reading multiple lines from the same file, not the same line from multiple files.
Instead, better do
with open(file_path, "r") as infile:
dummy = infile.readline()
cand_summary = infile.readline.strip()
Also, if you drop the strip() method, you don't have to re-add the \n, but who knows why you have that in there. Perhaps .lstrip() would be better?
Finally, what's with the manual while loop? Why not use a for loop?
Lastly, after your comment, I understand you want to put the result in a list instead of a file. OK.
All in all:
summary = []
for count in xrange(num_files):
file_p = [directory,str(count),'.txt'] # or count+1, if you start at 1
file_path = ''.join(file_p)
with open(file_path, "r") as infile:
dummy = infile.readline()
cand_summary = infile.readline().strip()
summary.append(cand_summary)

Related

Python Question - How to extract text between {textblock}{/textblock} of a .txt file?

I want to extract the text between {textblock_content} and {/textblock_content}.
With this script below, only the 1st line of the introtext.txt file is going to be extracted and written in a newly created text file. I don't know why the script does not extract also the other lines of the introtext.txt.
f = open("introtext.txt")
r = open("textcontent.txt", "w")
for l in f.readlines():
if "{textblock_content}" in l:
pos_text_begin = l.find("{textblock_content}") + 19
pos_text_end = l.find("{/textblock_content}")
text = l[pos_text_begin:pos_text_end]
r.write(text)
f.close()
r.close()
How to solve this problem?
Your code actually working fine, assuming you have begin and end block in your line. But I think this is not what you dreamed of. You can't read multiple blocks in one line, and you can't read block which started and ended in different lines.
First of all take a look at the object which returned by open function. You can use method read in this class to access whole text. Also take a look at with statements, it can help you to make actions with file easier and safely. And to rewrite your code so it will read something between {textblockcontent} and {\textblockcontent} we should write something like this:
def get_all_tags_content(
text: str,
tag_begin: str = "{textblock_content}",
tag_end: str = "{/textblock_content}"
) -> list[str]:
useful_text = text
ans = []
# Heavy cicle, needs some optimizations
# Works in O(len(text) ** 2), we can better
while tag_begin in useful_text:
useful_text = useful_text.split(tag_begin, 1)[1]
if tag_end not in useful_text:
break
block_content, useful_text = useful_text.split(tag_end, 1)
ans.append(block_content)
return ans
with open("introtext.txt", "r") as f:
with open("textcontent.txt", "w+") as r:
r.write(str(get_all_tags_content(f.read())))
To write this function efficiently, so it can work with a realy big files on you. In this implementation I have copied our begin text every time out context block appeared, it's not necessary and it's slow down our program (Imagine the situation where you have millions of lines with content {textblock_content}"hello world"{/textblock_content}. In every line we will copy whole text to continue out program). We can use just for loop in this text to avoid copying. Try to solve it yourself
When you call file.readlines() the file pointer will reach the end of the file. For further calls of the same, the return value will be an empty list so if you change your code to sth like one of the below code snippets it should work properly:
f = open("introtext.txt")
r = open("textcontent.txt", "w")
f_lines = f.readlines()
for l in f_lines:
if "{textblock_content}" in l:
pos_text_begin = l.find("{textblock_content}") + 19
pos_text_end = l.find("{/textblock_content}")
text = l[pos_text_begin:pos_text_end]
r.write(text)
f.close()
r.close()
Also, you can implement it through with context manager like the below code snippet:
with open("textcontent.txt", "w") as r:
with open("introtext.txt") as f:
for line in f:
if "{textblock_content}" in l:
pos_text_begin = l.find("{textblock_content}") + 19
pos_text_end = l.find("{/textblock_content}")
text = l[pos_text_begin:pos_text_end]
r.write(text)

object.write() is not working as expected

I am new in python. I want to read one file and copy data to another file. my code is following. In code below, when I open the files inside the for loop then I can write all the data into dst_file. but it takes 8 seconds to write dst_file.
for cnt, hex_num in enumerate(hex_data):
with open(src_file, "r") as src_f, open(dst_file, "a") as dst_f:
copy_flag = False
for src_line in src_f:
if r"SPI_frame_0" in src_line:
src_line = src_line.replace('SPI_frame_0', 'SPI_frame_' + str(cnt))
copy_flag = True
if r"halt" in src_line:
copy_flag = False
if copy_flag:
copy_mid_data += src_line
updated_data = WriteHexData(copy_mid_data, hex_num, cnt, msb_lsb_flag)
copy_mid_data = ""
dst_f.write(updated_data)
To improve performance, I am trying to open the files outside of the for loop. but it is not working properly. it is writing only once (one iteration of for loop) in the dst_file. As shown below.
with open(src_file, "r") as src_f, open(dst_file, "a") as dst_f:
for cnt, hex_num in enumerate(hex_data):
copy_flag = False
for src_line in src_f:
if r"SPI_frame_0" in src_line:
src_line = src_line.replace('SPI_frame_0', 'SPI_frame_' + str(cnt))
copy_flag = True
if r"halt" in src_line:
copy_flag = False
if copy_flag:
copy_mid_data += src_line
updated_data = WriteHexData(copy_mid_data, hex_num, cnt, msb_lsb_flag)
copy_mid_data = ""
dst_f.write(updated_data)
can someone please help me to find my mistake?
Files are iterators. Looping over them reads the file line by line. Until you reach the end. They then don't just go back to the start when you try to read more. A new for loop over a file object does not 'reset' the file.
Either re-open the input file each time in the loop, seek back to the start explicitly, or read the file just once. You can seek back with src_f.seek(0), reopening means you need to use two with statements (one to open the output file once, the other in the for loop to handle the src_f source file).
In this case, given that you build up the data to be written out to memory in one go anyway, I'd read the input file just once, keeping only the lines you need to copy.
You can use multiple for loops over the same file object, the file position will change accordingly. That makes reading a series of lines from a match on one key string to another very simple. The itertools.takewhile() function makes it even easier:
from itertools import takewhile
# read the correct lines (from SPI_frame_0 to halt) from the source file
lines = []
with open(src_file, "r") as src_f:
for line in src_f:
if r"SPI_frame_0" in line:
lines.append(line)
# read additional lines until we find 'halt'
lines += takewhile(lambda l: 'halt' not in l, src_f)
# transform the source lines with a new counter
with open(dst_file, "a") as dst_f:
for cnt, hex_num in enumerate(hex_data):
copy_mid_data = []
for line in lines:
if "SPI_frame_0" in line:
line = line.replace('SPI_frame_0', 'SPI_frame_{}'.format(cnt))
copy_mid_data.append(line)
updated_data = WriteHexData(''.join(copy_mid_data), hex_num, cnt, msb_lsb_flag)
dst_f.write(updated_data)
Note that I changed copy_mid_data to a list to avoid quadratic string copying; it is far more efficient to join a list of strings just once.

Write all lines for each set of a range to new file each time the range changes Python 3.6

trying to find a way of making this process work pythonically or at all. Basically, I have a really long text file that is split into lines. Every x number of lines there is one that is mainly uppercase, which should roughly be the title of that particular section. Ideally, I'd want the title and everything after to go into a text file using the title as the name for the file. This would have to happen 3039 in this case as that is as many titles will be there.
My process so far is this: I created a variable that reads through a text file tells me if it's mostly uppercase.
def mostly_uppercase(text):
threshold = 0.7
isupper_bools = [character.isupper() for character in text]
isupper_ints = [int(val) for val in isupper_bools]
try:
upper_percentage = np.mean(isupper_ints)
except:
return False
if upper_percentage >= threshold:
return True
else:
return False
Afterwards, I made a counter so that I could create an index and then I combined it:
counter = 0
headline_indices = []
for line in page_text:
if mostly_uppercase(line):
print(line)
headline_indices.append(counter)
counter+=1
headlines_with_articles = []
headline_indices_expanded = [0] + headline_indices + [len(page_text)-1]
for first, second in list(zip(headline_indices_expanded, headline_indices_expanded[1:])):
article_text = (page_text[first:second])
headlines_with_articles.append(article_text)
All of that seems to be working fine as far as I can tell. But when I try to print the pieces that I want to files, all I manage to do is print the entire text into all of the txt files.
for i in range(100):
out_pathname = '/sharedfolder/temp_directory/' + 'new_file_' + str(i) + '.txt'
with open(out_pathname, 'w') as fo:
fo.write(articles_filtered[2])
Edit: This got me halfway there. Now, I just need a way of naming each file with the first line.
for i,text in enumerate(articles_filtered):
open('/sharedfolder/temp_directory' + str(i + 1) + '.txt', 'w').write(str(text))
One conventional way of processing a single input file involves using a Python with statement and a for loop, in the following way. I have also adapted a good answer from someone else for counting uppercase characters, to get the fraction you need.
def mostly_upper(text):
threshold = 0.7
## adapted from https://stackoverflow.com/a/18129868/131187
upper_count = sum(1 for c in text if c.isupper())
return upper_count/len(text) >= threshold
first = True
out_file = None
with open('some_uppers.txt') as some_uppers:
for line in some_uppers:
line = line.rstrip()
if first or mostly_upper(line):
first = False
if out_file: out_file.close()
out_file = open(line+'.txt', 'w')
print(line, file=out_file)
out_file.close()
In the loop, we read each line, asking whether it's mostly uppercase. If it is we close the file that was being used for the previous collection of lines and open a new file for the next collection, using the contents of the current line as a title.
I allow for the possibility that the first line might not be a title. In this case the code creates a file with the contents of the first line as its names, and proceeds to write everything it finds to that file until it does find a title line.

For loop isn't working in python 3

I'm trying to save a file from a URL into a folder on my computer, but I have 732 URLs (that when saved, gives experimental data) in a list. I'm trying to run a for loop on all those URLs to save each data set into its own file. This is what I'm doing right now:
for i in ExperimentURLs:
myurl123 = str(i)
myreq = urllib.request.urlopen(myurl123)
mydata = myreq.read()
with open('/Users/lauren/Desktop/IDData/file', 'wb') as ofile:
ofile.write(mydata)
ExperimentURLs is my list of URLs, but I don't know how to handle the for loop to save each data set into a new file. Currently, this code only writes a single experiment's data into a file and stops there. If I try to save it to a different file name, it takes a different experiment's data and saves that to the file. Help?
First, you need to automatically generate a new output file name every time through the loop. I'll give you the trivial version below. Also, note that the URLs are already strings; you don't have to convert them.
pos = 0
for myurl123 in ExperimentURLs:
myreq = urllib.request.urlopen(myurl123)
mydata = myreq.read()
out_file = '/Users/lauren/Desktop/IDData/file' + str(pos)
with open(out_file, 'wb') as ofile:
ofile.write(mydata)
pos += 1
Does that solve your problem?
BTW, you can do the two iterations in parallel with
for i, myurl123 in enumerate(ExperimentURLs):
Your mistake is simply at the point of writing the files. Not that the for loop is not working. You are writing to the same file again and again. Here is a modified version, using requests. All you need to do is simply change the file name when saving.
import requests
ExperimentURLs = [
"https://www.google.com",
"https://www.yahoo.com"
]
counter = 0;
for i in ExperimentURLs:
myurl123 = str(i)
r = requests.get(myurl123)
mydata = r.text.encode('utf-8').strip()
fileName = counter
with open("results/"+str(fileName)+".html", 'w') as ofile:
ofile.write(mydata)
counter += 1

Python read file in while loop

I am currently in some truble regarding python and reading files. I have to open a file in a while loop and do some stuff with the values of the file. The results are written into a new file. This new file is then read in the next run of the while loop. But in this second run I get no values out of this file... Here is a code snippet, that hopefully clarifies what I mean.
while convergence == 0:
run += 1
prevrun = run-1
if os.path.isfile("./Output/temp/EmissionMat%d.txt" %prevrun) == True:
matfile = open("./Output/temp/EmissionMat%d.txt" %prevrun, "r")
EmissionMat = Aux_Functions.EmissionMat(matfile)
matfile.close()
else:
matfile = open("./Input/EmissionMat.txt", "r")
EmissionMat = Aux_Functions.EmissionMat(matfile)
matfile.close()
# now some valid operations, which produce a matrix
emissionmat_file = open("./output/temp/EmissionMat%d.txt" %run, "w")
emissionmat_file.flush()
emissionmat_file.write(str(matrix))
emissionmat_file.close()
Solved it!
matfile.seek(0)
This resets the pointer to the begining of the file and allows me to read the file in the next run correctly.
Why to write to a file and then read it ? Moreover you use flush, so you are doing potentially long io. I would do
with open(originalpath) as f:
mat = f.read()
while condition :
run += 1
write_mat_run(mat, run)
mat = func(mat)
write_mat_run may be done in another thread. You should check io exceptions.
BTW this will probably solve your bug, or at least make it clear.
I can see nothing wrong with your code. The following concrete example worked on my Linux machine:
import os
run = 0
while run < 10:
run += 1
prevrun = run-1
if os.path.isfile("output%d.txt" %prevrun):
matfile = open("output%d.txt" %prevrun, "r")
data = matfile.readlines()
matfile.close()
else:
matfile = open("input.txt", "r")
data = matfile.readlines()
matfile.close()
data = [ s[:-1] + "!\n" for s in data ]
emissionmat_file = open("output%d.txt" %run, "w")
emissionmat_file.writelines(data)
emissionmat_file.close()
It adds an exclamation mark to each line in the file input.txt.
I solved it
before closing the file I do
matfile.seek(0)
This solved my problem. This methods sets the pointer of the reader to the beginning.

Categories

Resources