How to write to file in different places? - python

I want to
open file
add 4 underline character to beginning of line
find blank lines
replace the newline character in the blank lines with 50 underline characters
add new lines before and after 50 underline characters
I found many similar questions in stackoverflow but I could not combine all these operations without getting errors. See my previous question here. Is there a simple beginners way to accomplish this so that I can start from there? (I don't mind writing to the same file; there is no need to open two files) Thanks.

You're going to have to pick:
Use two files, but never have to store more than 1 line in memory at a time
or
Build the new file in memory as you read the original, then overwrite the original with the new
A file isn't a flexible memory structure. You can't replace the 1 or 2 characters from a newline with 50 underscores, it just doesn't work like that. If you are sure the new file is going to be a manageable size and you don't mind writing over the original, you can do it without having a new file.
Myself, I would always allow the user to opt for an output file. What if something goes wrong? Disk space is super cheap.

You can do everything you want reading the file first, performing the changes on the lines, and finally writing it back. If the file doesn't fit in memory, then you should read the file in batches and create an temporal file. You can't modify the file in situ.

Related

Search for a word, and modify the whole line in Python text processing

This is my carDatabase.txt
CarID:c01 ModelName:honda VehicleType:city Price:20
CarID:c02 ModelName:honda VehicleType:x Price:30
I want to search for the carID and be only able to modify the whole line without interrupting others
my current code is here:
# Converting txt data into a string and modify
carsDatabaseFile = open('carsDatabase.txt', 'r')
allDataFromDatabase = [line.split(',') for line in carsDatabaseFile.readlines()]
Note:
Your question has a couple of issues: your sample from carDatabase.txt looks like it is tab-delimited, but your current code looks like it is splitting the line around the ',' character. This also looks like a place where a list comprehension might be hurting you more than it is helping you. Break that up into a for-loop if you're trying to add some logic to manipulate a single line.
For looking at CSV files, I would highly recommend using pandas for general manipulation of data in comma ceparated as well as a number of other formats.
That said, if you are truly restricted to only using built-in packages, or you are looking at this as a learning exercise, and your goal is to directly manipulate just one line of that file, what you are looking for is the seek method. You can use this in combination with the tell method ( documented just blow seek in the above link ) to find where you are in the file.
Write a for loop to identify which line in the file you are looking for
From there, you can get the output of tell() to find the specific place in the file you are trying to manipulate
Using the output from the above two steps, you can set the file pointer to a specific location using the seek() method (by byte: files are really stored as one dimensional).
You can now use the write() method to directly update the file at the location you determined above.

Edit a few lines of uncompressed PDF in Python

I want to edit a few lines in an uncompressed pdf.
I found a similar problem but since I need to scan the file a few times to get the exact line positions I want to change this doesn't really suit (and the pure number of RegEx matches are more than desired).
The pdf contains utf-8 encodable lines (a few of them I want to edit, bookmark target ids in particular)
and a lot of blobs (guess images and so on).
When I edit the file with notepad it's working fine, but when I do it programatically (reading in, changing a few lines, writing back)
images and some formatting is missing. (Sine they are not read in at the firstplace, ignore-option)
with codecs.open("merged-uncompressed.pdf", "r", encoding='ascii', errors='ignore') as f:
I can read the file in with errors="surrogateescape" and wanted to map the lines from above import but don't know if this approach can work.
Does anyone know a way how to deal with this?
Best, Lukas
I was able to solve this:
read the file as binary
marked the lines which couldn't be encoded utf-8
copied the list line by line to a temporary list ( not encodable lines were copied with a placholder 'None\n')
Then I went back to do the searching part on the copied list so I got my lines I wanted to replace
replaced the lines in the original binary list (same indices!)
wrote it back to file
the resulting pdf was a bit corupted because of whitespace before the target ids of the bookmarks but by recompressing qpdf fixed it:)
The code is very messy at the moment and so I don't want to publish it right now.
But I want to add it at github within the next few weeks.
If anyone needs it: just comment and it will have more priority.
Thanks to anyone who wanted to help:)
Lukas

Single Line from file is too big?

In python, I'm reading a large file, and I want to add each line(after some modifications) to an empty list. I want to do this to only the first few lines, so I did:
X = []
for line in range(3):
i = file.readline()
m = str(i)
X.append(m)
However, an error shows up, and says there is a MemoryError for the line
i = file.readline().
What should I do? It is the same even if I make the range 1 (although I don't know how that affects the line, since it's inside the loop).
How do I not get the error code? I'm iterating, and I can't make it into a binary file because the file isn't just integers - there's decimals and non-numerical characters.
The txt file is 5 gigs.
Any ideas?
filehandle.readline() breaks lines via the newline character (\n) - if your file has gigantic lines, or no new lines at all, you'll need to figure out a different way of chunking it.
Normally you might read the file in chunks and process those chunks one by one.
Can you figure out how you might break up the file? Could you, for example, only read 1024 bytes at a time, and work with that chunk?
If not, it's often easier to clean up the format of the file instead of designing a complicated reader.

Is there a straightforward way to write to a file open in r+ mode without overwriting existing bytes?

I have a text file test.txt, with the following contents:
Thing 1. string
And I'm creating a python file that will increment the number every time it gets run without affecting the rest of the string, like so.
Run once:
Thing 2. string
Run twice:
Thing 3. string
Run three times:
Thing 4. string
Run four times:
Thing 5. string
This is the code that I'm using to accomplish this.
file = open("test.txt","r+")
started = False
beginning = 0 #start of the digits
done = False
num = 0
#building the number from digits
while not done:
next = file.read(1)
if ord(next) in range(48, 58): #ascii values of 0-9
started = True
num *= 10
num += int(next)
elif started: #has reached the end of the number
done = True
else: #has not reached the beginning of the number
beginning += 1
num += 1
file.seek(beginning,0)
file.write(str(num))
This code works, so long as the number is not 10^n-1 (9, 99, 999, etc) because in those cases, it writes more bytes than were previously in the number. As such, it will override the characters that follow.
So this brings me to the point. I need a way to write to the file that overwrites previously bytes, which I have, and a way to write to the file that does not overwrite previously existing bytes, which I don't have. Does such a mechanism exist in python, and if so, what is it?
I have already tried opening the file using the line file = open("test.txt","a+") instead. When I do that, it always writes to the end, regardless of the seek point.
file = open("test.txt","w+") will not work because I need to keep the contents of the file while altering it, and files opened in any variant of w mode are wiped clean.
I have also thought of solving my problem using a function like this:
#file is assumed to be in r+ mode
def write(string, file, index = -1):
if index != -1:
file.seek(index, 0)
remainder = file.read()
file.seek(index)
file.write(remainder + string)
But I also want to be able to expand the solution to larger files, and reading the rest of the file single-handedly changes what I'm trying to accomplish from being O(1) to O(n). It also seems very non-Pythonic, since it seeks to accomplish the task in a less-than-straightforward way.
It would also make my I/O operations inconsistent: I would have class methods (file.read() and file.write()) to read from the file and write to it replacing old characters, but an external function to insert without replacing.
If I make the code inline, rather than a function, it means I have to write several of the same lines of code every time I try to write without replacing, which is also non-Pythonic.
To reiterate my question, is there a more straightforward way to do this, or am I stuck with the function?
Unfortunately, what you want to do is not possible. This is a limitation at a lower level than Python, in the operating system. Neither the Unix nor the Windows file access API offers any way to insert new bytes in the middle of a file without overwriting the bytes that were already there.
Reading the rest of the file and rewriting it is the usual workaround. Actually, the usual workaround is to rewrite the entire file under a new name and then use rename to move it back to the old name. On Unix, this accomplishes an atomic file update - unless the computer crashes, concurrent readers will see either the new file or the old file, not some hybrid. (Windows, sadly, still does not allow you to rename over a name that already exists, so if you use this strategy you have to delete the old file first, opening an unavoidable race window where the file might appear not to exist at all.)
Yes, this is O(N), and yes, if you use the write-new-file-and-rename strategy it temporarily consumes scratch disk space equal to the size of the file (old or new, whichever is larger). That's just how it is.
I haven't thought about it enough to give you even a sketch of the code, but it should be possible to use context managers to wrap up the write-new-file-and-rename approach tidily.
No, the disk doesn't work like you think it does.
You have to remember that your file is stored on disk as one contiguous
chunk of data*
Your disk happens to be wound up in a great big spool, a bit like a record,
but if you were to unwind your file, you'd get something that looks like
this:
+------------------------------------------------------------+
| Thing 1. String |
+------------------------------------------------------------+
^ ^
^ | \_, ^
| Start of file End of File |
Start of disk End of disk
As you've discovered, there's no way to simply insert data in the middle.
Generally speaking, that wouldn't be possible at all, without physically
altering your disk. And who wants to do that? Especially when just flipping
the magnetic bits on your disk is so much easier and faster. In order to
do what you want to do, you have to read the bytes the you want to
overwrite, then start writing down your new ones. It might look something
like this:
Open the file
Seek to the point of insert
Read the current byte
Seek backward one byte
Write down the first byte of the new string
Read the next byte
Seek backward one byte
Write down the next byte of the new string
Repeat until all the bytes have been written to disk
close the file
Of course, this might be a little bit on the slow side, due to all the
seeking back & forth in the file. It might be faster to read each line,
and then seek back to the previous location in the file. It should be
relatively straightforward to implement something like this in Python,
but as you've discovered, there are system limitations that Python can't
really overcome.
*Unless the files are fragmented, but we're living in an ideal
world where gravity adheres to 9.8m/s2 and the Earth is a perfect
sphere.

python, seek, tell, read. Reading lines from giant csv file

I have a giant file (1.2GB) of feature vectors saved as a csv file.
In order to go through the lines, I've created a python class that loads batches of rows from the giant file, to the memory, one batch at a time.
In order for this class to know where exactly to read in the file to get a batch of batch_size complete rows (lets say batch_size=10,000), in the first time using a giant file, this class goes through the entire file once, and registers the offset of each line, and saves these offsets to a helping file, so that later it could "file.seek(starting_offset); batch = file.read(num_bytes)" to read the next batch of lines.
First, I implemented the registration of line offsets in this manner:
offset = 0;
line_offsets = [];
for line in self.fid:
line_offsets.append(offset);
offset += len(line);
And it worked lovely with giant_file1.
But then I processed these features and created giant_file2 (with normalized features), with the assistance of this class I made.
And next, when I wanted to read batches of lines form giant_file2, it failed, because the batch strings it would read were not in the right place (for instance, reading something like "-00\n15.467e-04,..." instead of "15.467e-04,...\n").
So I tried changing the line offset calculation part to:
offset = 0;
line_offsets = [];
while True:
line = self.fid.readline();
if (len(line) <= 0):
break;
line_offsets.append(offset);
offset = self.fid.tell();
The main change is that the offset I register is taken from the result of fid.tell() instead of cumulative lengths of lines.
This version worked well with giant_file2, but failed with giant_file1.
The further I investigated it, I came to the feeling that functions seek(), tell() and read() are inconsistent with each other.
For instance:
fid = file('giant_file1.csv');
fid.readline();
>>>'0.089,169.039,10.375,-30.838,59.171,-50.867,13.968,1.599,-26.718,0.507,-8.967,-8.736,\n'
fid.tell();
>>>67L
fid.readline();
>>>'15.375,91.43,15.754,-147.691,54.234,54.478,-0.435,32.364,4.64,29.479,4.835,-16.697,\n'
fid.seek(67);
fid.tell();
>>>67L
fid.readline();
>>>'507,-8.967,-8.736,\n'
There is some contradiction here: when I'm positioned (according to fid.tell()) at byte 67 once the line read is one thing and in the second time (again when fid.tell() reports I'm positioned at byte 67) the line that is read is different.
I can't trust tell() and seek() to put me in the desired location to read from the beginning of the desired line.
On the other hand, when I use (with giant_file1) the length of strings as reference for seek() I get the correct position:
fid.seek(0);
line = fid.readline();
fid.tell();
>>>87L
len(line);
>>>86
fid.seek(86);
fid.readline();
>>>'15.375,91.43,15.754,-147.691,54.234,54.478,-0.435,32.364,4.64,29.479,4.835,-16.697,\n'
So what is going on?
The only difference between giant_file1 and giant_file2 that I can think of is that in giant_file1 the values are written with decimal dot (e.g. -0.435), and in giant_file2 they are all in scientific format (e.g. -4.350e-01). I don't think any of them is coded in unicode (I think so, since the strings I read with simple file.read() seem readable. how can I make sure?).
I would very much appreciate your help, with explanations, ideas for the cause, and possible solutions (or workarounds).
Thank you,
Yonatan.
I think you have a newline problem. Check whether giant_file1.csv is ending lines with \n or \r\n If you open the file in text mode, the file will return lines ending with \n, only and throw away redundant \r. So, when you look at the length of the line returned, it will be 1 off of the actual file position (which has consumed not just the \n, but also the \r\n). These errors will accumulate as you read more lines, of course.
The solution is to open the file in binary mode, instead. In this mode, there is no \r\n -> \n reduction, so your tally of line lengths would stay consistent with your file tell( ) queries.
I hope that solves it for you - as it's an easy fix. :) Good luck with your project and happy coding!
I had to do something similar in the past and ran into something in the standard library called linecache. You might want to look into that as well.
http://docs.python.org/library/linecache.html

Categories

Resources