write into file without reading all the file - python

I have a template of a file (html) with the header and footer. I try to insert text into right after <trbody>.
The way i'm doing it right now is with fileinput.input()
def write_to_html(self,path):
for line in fileinput.input(path, inplace=1):
line = re.sub(r'CURRENT_APPLICATION', obj, line)
line = re.sub(r'IN_PROGRESS', time.strftime("%Y-%m-%d %H:%M:%S"), line)
line = re.sub(r'CURRENT_VERSION', svers, line)
print line, # preserve old content
if "<tbody>" in line:
print ("<tr>")
###PRINT MY STUFFS
print ("</tr>")
I call this for each Table-line I have to add in my html table. but I have around 5k table-lines to add (each line is about 30 lines of hmtl code). It starts fast, but each line takes more and more times to be added. It's because it has to write the file all over again for each line right ?
Is there a way to speed up the process?
EDIT thanks for the responses :
I like the idee of creating my big string, and the just go through the file just once.
I'll have to change some stuff because right now because the function I showed is in a Classe. and in my main programe, I just iterate on a folder containing .json.
for json in jsonfolder :
Object_a = CLASS-A(json) #unserialization
Object_a.write_to_html() (the function i showed)
I should turn that into :
block_of_lines=''
for json in jsonfolder :
Object_a = CLASS-A(json) #unserialization
block_of_line += Object_a.to_html_sting()
Create_html(block_of_line)
Would that be faster ?

Re-reading the question a couple more times, the following thought occurs.
Could you split the writing into 3 blocks - one for the header, one for the table lines and another for the footer. It does rather seem to depend on what those three substitution lines are doing, but if I'm right, they can only update lines the first time the template is used, ie. while acting on the first json file, and then remain unchanged for the others.
file_footer = CLASS-A.write_html_header(path)
for json in jsonfolder :
Object_a = CLASS-A(json) #unserialization
Object_a.write_to_html(path) #use the part of the function
# that just handles the json file here
CLASS-A.write_html_footer(path, footer)
Then in your class, define the two new functions to write the header and footer as static methods (which means they can be used from the class rather than just on an instance)
i.e. (using a copy from your own code)
#staticmethod
def write_html_header(path):
footer = []
save_for_later = false
for line in fileinput.input(path, inplace=1):
line = re.sub(r'CURRENT_APPLICATION', obj, line)
line = re.sub(r'IN_PROGRESS', time.strftime("%Y-%m-%d %H:%M:%S"), line)
line = re.sub(r'CURRENT_VERSION', svers, line)
# this blocks prints the header, and saves the
# footer from your template.
if save_for_later:
footer.append(line)
else:
print line, # preserve old content
if "<tbody>" in line:
save_for_later = true
return footer
I do wonder why you're editing 'inplace' doesn't that mean the template get's overwritten, and thus it's less of a template and more of a single use form. Normally when I use a template, I read in from the template, and write out to a new file an edited version of the template. Thus the template can be re-used time and time again.
For the footer section, open your file in append mode, and then write the lines in the footer array created by the call to the header writing function.
I do think not editing the template in place would be of benefit to you. then you'd just need to :
open the template (read only)
open the new_file (in new, write mode)
write the header into new_file
loop over json files
append table content into new_file
append the footer into new_file
That way you're never re-reading the bits of the file you created while looping over the json files. Nor are you trying to store the whole file in memory if that is a concern.

5000 lines is nothing. Read the entire file using f.readlines() to get a list of lines:
with open(path) as f:
lines = f.readlines()
Then process each line, and eventually join them to one string and write the entire thing back to the file.

Related

Read in Python code/regular expressions from external file

I have a section of Python (Sigil) code:
for (id, href) in bk.text_iter():
html = bk.readfile(id)
html = re.sub(r'<title></title>', '<title>Mara’s Tale</title>', html)
html = re.sub(r'<p>Mara’s Tale</p>', '<p class="title">Mara’s Tale</p>',html)
bk.writefile(id, html)
Ideally, I'd like to read the regular expressions in from an external text-file (or just read in that block of code). Any suggestions? I've done similar in Perl with a try, but I'm a Python-novice.
Also, quick supplementary question - shouldn't bk.writefile be indented? And, if so, why is my code working? It looks as though it's outside the for block, and therefore will only write to the final file, if that (it's an epub, so there are several html files), but it's updating all relevant files.
Regarding bk, my understanding is that this object is the whole epub, and what this code is doing is reading each html file that makes up an epub via text_iter, so id is each individual file.
EDIT TO ADD
Ah! That bk.writefile should indeed be indented. I got away with it because, at the point I run this code, I only have a single html file.
As for the reading something from a file - it's easy. Assume you have the file 'my_file.txt' in the same folder where the script is saved:
f = open('my_file.txt', 'r')
content = f.read() # read all content of the file in the sting 'content'
lines = f.read().splitlines() # read lines of the file in array 'lines'
f.close()
print(lines[0]) # first line
print(lines[1]) # second line
# etc
As for shouldn't bk.writefile be indented? Yep, it seems the loop makes and changes the variable html for several times but saves only the last iteration. It looks weird. Perhaps it should be indented. But it's just a guess.

How to get rid of '\n' as a part of dict value when assigning from text file

I write function to change JSON with provided informations, specifically EAN, which I read from file. EAN is put with \n at the end and I'm not able to get rid of it.
I tried replace('\n', ''), value[:-3] but that does affect only number.
I tried add parameter newline=""/None in open function, only adding \r between number and \n
eans.txt is simply file containing eans each on new line without any gaps or tabs
material_template =
{
"eanUpc": "",
}
def get_ean():
with open('eans.txt', 'r') as x:
first = x.readline()
all = x.read()
with open('eans.txt', 'w') as y:
y.writelines(all[1:])
return first
def make_material(material_template):
material_template["eanUpc"] = get_ean()
print(material_template)
print(material_template["eanUpc"])
make_material(material_template)
{'eanUpc': '061500127178\n'}
061500127178
thanks in advance
Instead of return first use return first.strip("\n") to trim off the \n.
However, I question the organization of the original system... reading one line then writing the whole file back over the original.
What if the process fails during the processing of one line? The data is lost
IF this file is large, you'll be doing a lot of disk i/o
Maybe read one line at a time, and save an offset into the file (last_processed) into another file? Then you can seek to that location.
Should use return first.strip(). It is removing leading white-spaces.

How to take items in a list, and print them to a new .txt document on separate lines

okay, this may have been talked about before, but I am unable to find it anywhere on stack so here i am.
Basically I am writing a script that will take a .txt document and store every other line (even lines say) and print them into a new text document.
I was able to successfully write my code to scan the text and remove the even numbered lines and put them into a list as independent variables but when i got to add each item of the list to the new text documents, depending on where i do that i get either the first line or the last line but never more than one.
here is what i have
f = open('stuffs.txt', 'r')
i = 1
x = []
for line in f.readlines():
if i % 2 == 0:
x.append(line)
i += 1
I have tested that this successfully takes the proper lines and stores them in list x
i have tried
for m in x:
t = open('stuffs2.txt','w')
t.write(m)
directly after, and it only prints the last line
if i do
for line in f.readlines():
if i % 2 == 0:
t = open('stuffs2.txt','w')
t.write(line)
i += 1
it will print the first line
if i try to add the first solution to the for loop as a nested for loop it will also print the first line. I have no idea why it is not taking each individual item and putting it in the .txt
when i print the list it is in there as it should be.
Did look for a canonical - did not find one...
open('stuffs2.txt','w') - "w" == kill whats there and open new empty file ...
Read the documentation: reading-and-writing-files :
7.2. Reading and Writing Files
open() returns a file object, and is most commonly used with two arguments: open(filename, mode). f = open('workfile', 'w')
The first argument is a string containing the filename.
The second argument is another string
containing a few characters describing the way in which the file will
be used.
mode can be 'r' when the file will only be read, 'w' for only
writing (an existing file with the same name will be erased), and 'a'
opens the file for appending; any data written to the file is
automatically added to the end. 'r+' opens the file for both reading
and writing. The mode argument is optional; 'r' will be assumed if
it’s omitted.
To write every 2nd line more economically:
with open("file.txt") as f, open("target.txt","w") as t:
write = True
for line in f:
if write:
t.write(line)
write = not write
this way you do not need to store all lines in memory.
The with open(...) as name : syntax is also better - it will close your filehandle (which you do not do) even if exceptions arise.

Python reading nothing from file [duplicate]

I am a beginner of Python. I am trying now figuring out why the second 'for' loop doesn't work in the following script. I mean that I could only get the result of the first 'for' loop, but nothing from the second one. I copied and pasted my script and the data csv in the below.
It will be helpful if you tell me why it goes in this way and how to make the second 'for' loop work as well.
My SCRIPT:
import csv
file = "data.csv"
fh = open(file, 'rb')
read = csv.DictReader(fh)
for e in read:
print(e['a'])
for e in read:
print(e['b'])
"data.csv":
a,b,c
tree,bough,trunk
animal,leg,trunk
fish,fin,body
The csv reader is an iterator over the file. Once you go through it once, you read to the end of the file, so there is no more to read. If you need to go through it again, you can seek to the beginning of the file:
fh.seek(0)
This will reset the file to the beginning so you can read it again. Depending on the code, it may also be necessary to skip the field name header:
next(fh)
This is necessary for your code, since the DictReader consumed that line the first time around to determine the field names, and it's not going to do that again. It may not be necessary for other uses of csv.
If the file isn't too big and you need to do several things with the data, you could also just read the whole thing into a list:
data = list(read)
Then you can do what you want with data.
I have created small piece of function which doe take path of csv file read and return list of dict at once then you loop through list very easily,
def read_csv_data(path):
"""
Reads CSV from given path and Return list of dict with Mapping
"""
data = csv.reader(open(path))
# Read the column names from the first line of the file
fields = data.next()
data_lines = []
for row in data:
items = dict(zip(fields, row))
data_lines.append(items)
return data_lines
Regards

Adding brackets and commas to multiple JSON objects

I've created a very simple piece of code to read in tweets in JSON format in text files, determine if they contain an id and coordinates and if so, write these attributes to a csv file. This is the code:
f = csv.writer(open('GeotaggedTweets/ListOfTweets.csv', 'wb+'))
all_files = glob.glob('SampleTweets/*.txt')
for filename in all_files:
with open(filename, 'r') as file:
data = simplejson.load(file)
if 'text' and 'coordinates' in data:
f.writerow([data['id'], data['geo']['coordinates']])
I've been having some difficulties but with the help of the excellent JSON Lint website have realised my mistake. I have multiple JSON objects and from what I read these need to be separated by commas and have square brackets added to the start and end of the file.
How can I achieve this? I've seen some examples online where each individual line is read and it's added to the first and last line, but as I load the whole file I'm not entirely sure how to do this.
You have a file that either contains too many newlines (in the JSON values themselves) or too few (no newlines between the tweets at all).
You can still repair this by using some creative re-stitching. The following generator function should do it:
import json
def read_objects(filename):
decoder = json.JSONDecoder()
with open(filename, 'r') as inputfile:
line = next(inputfile).strip()
while line:
try:
obj, index = decoder.raw_decode(line)
yield obj
line = line[index:]
except ValueError:
# Assume we didn't have a complete object yet
line += next(inputfile).strip()
if not line:
line += next(inputfile).strip()
This should be able to read all your JSON objects in sequence:
for filename in all_files:
for data in read_objects(filename):
if 'text' and 'coordinates' in data:
f.writerow([data['id'], data['geo']['coordinates']])
It is otherwise fine to have multiple JSON strings written to one file, but you need to make sure that the entries are clearly separated somehow. Writing JSON entries that do not use newlines, then using newlines in between them, for example, makes sure you can later on read them one by one again and process them sequentially without this much hassle.

Categories

Resources