Python : Text Replacement In Large Files

Python : Text Replacement In Large Files - python

I'm trying to insert text at very specific locations in a text file. This text file can be fairly large (>> 10 GB)
The approach I am currently using to read it:
with open("my_text_file.txt") as f:
while True:
result = f.read(set_number_of_bytes)
x = process_result(result)
if x:
replace_some_characters_that_i_just_read_and write_it_back_to_same_file
However, I am unsure as to how to implement
replace_some_characters_that_i_just_read_and write_it_back_to_same_file
Is there some method which I can use to determine where I have read up to in the current file that I might be able to use to write to the file.
Performance-wise, if I was to use the approach above to write to the original file at specific locations, would there be efficiency issues with having to find the write location before writing?
Or would you recommend creating an entirely different file and appending to that file on each loop above. Then deleting the original file after this operation is completed? Assuming space is not a large concern but performance is.

Use the fileinput module, which handles files correctly when replacing data, with the inplace flag set:
import sys
import fileinput
for line in fileinput.input('my_text_file.txt', inplace=True):
x = process_result(line)
if x:
line = line.replace('something', x)
sys.stdout.write(line)
When you use the inplace flag, the original file is moved to a backup, and anything your write to sys.stdout is written to the original filename (so, as a new file). Make sure you include all lines, altered or not.
You have to rewrite the complete file whenever your replacement data is not exactly the same number of bytes as the parts that you are replacing.

Related

How to avoid file corruption in Python?

I have a pretty basic question, perhaps I didn't know the right keywords as I couldn't find a previous answer. I use Python scripts control and gather information for a smarthome environment. I mostly use text files to store and update information within and between the scripts. However, I frequently run into this one issue whenever the server crashes or loses power: The file contents tend to corrupt or vanish while the crash happens.
To write file content, I usually use a structure like this:
try:
with open(savefile, "r") as file:
lines = file.readlines()
except:
lines = []
pass
lines.append(str(time.time()) + ";" + str(value) + "\n")
if len(lines) > MAX_READINGS:
lines = lines[-MAX_READINGS:]
with open(savefile, "w") as file:
file.writelines(lines)
In case of partial corruption such as blank lines between the data points, I often use a line-by-line loop that only qualifies lines with the correct structure (such as a timestamp in the beginning in the example-like cases). However, sometimes a file gets corrupted to the point it only contains spaces or is empty, getting useless for the scrips depending on the data.
The filesystem's integrity remains intact in crashes, so it's probably not a lower level problem. But what's the suggested workaround to minimize the corruption risk?
Should I use the "a" mode to append a new line and have another way to deal with the file lengths (the MAX_READINGS), or should I make a temporary copy which I'd then use to overwrite the original after the writing is done? Or might there be an external library providing the right functionality?

Search for a word, and modify the whole line in Python text processing

This is my carDatabase.txt
CarID:c01 ModelName:honda VehicleType:city Price:20
CarID:c02 ModelName:honda VehicleType:x Price:30
I want to search for the carID and be only able to modify the whole line without interrupting others
my current code is here:
# Converting txt data into a string and modify
carsDatabaseFile = open('carsDatabase.txt', 'r')
allDataFromDatabase = [line.split(',') for line in carsDatabaseFile.readlines()]

Note:
Your question has a couple of issues: your sample from carDatabase.txt looks like it is tab-delimited, but your current code looks like it is splitting the line around the ',' character. This also looks like a place where a list comprehension might be hurting you more than it is helping you. Break that up into a for-loop if you're trying to add some logic to manipulate a single line.
For looking at CSV files, I would highly recommend using pandas for general manipulation of data in comma ceparated as well as a number of other formats.
That said, if you are truly restricted to only using built-in packages, or you are looking at this as a learning exercise, and your goal is to directly manipulate just one line of that file, what you are looking for is the seek method. You can use this in combination with the tell method ( documented just blow seek in the above link ) to find where you are in the file.
Write a for loop to identify which line in the file you are looking for
From there, you can get the output of tell() to find the specific place in the file you are trying to manipulate
Using the output from the above two steps, you can set the file pointer to a specific location using the seek() method (by byte: files are really stored as one dimensional).
You can now use the write() method to directly update the file at the location you determined above.

Selectively replacing csv header names

I have been searching for a solution for this and haven't been able to find one. I have a directory of folders which contain multiple, very-large csv files. I'm looping through each csv in each folder in the directory to replace values of certain headers. I need the headers to be consistent (from file to file) in order to run a different script to process all the data properly.
I found this solution that I though would work: change first line of a file in python.
However this is not working as expected. My code:
from_file = open(filepath)
# for line in f:
# if
data = from_file.readline()
# print(data)
# with open(filepath, "w") as f:
print 'DBG: replacing in file', filepath
# s = s.replace(search_pattern, replacement)
for i in range(len(search_pattern)):
data = re.sub(search_pattern[i], replacement[i], data)
# data = re.sub(search_pattern, replacement, data)
to_file = open(filepath, mode="w")
to_file.write(data)
shutil.copyfileobj(from_file, to_file)
I want to replace the header values in search_pattern with values in replacement without saving or writing to a different file - I want to modify the file. I have also tried
shutil.copyfileobj(from_file, to_file, -1)
As I understand it that should copy the whole file rather than breaking it up in chunks, but it doesn't seem to have an effect on my output. Is it possible that the csv is just too big?
I haven't been able to determine a different way to do this or make this way work. Any help would be greatly appreciated!

this answer from change first line of a file in python you copied from doesn't work in windows
On Linux, you can open a file for reading & writing at the same time. The system ensures that there's no conflict, but behind the scenes, 2 different file objects are being handled. And this method is very unsafe: if the program crashes while reading/writing (power off, disk full)... the file has a great chance to be truncated/corrupt.
Anyway, in Windows, you cannot open a file for reading and writing at the same time using 2 handles. It just destroys the contents of the file.
So there are 2 options, which are portable and safe:
create a file in the same directory, once copied, delete first file, and rename the new one
Like this:
import os
import shutil
filepath = "test.txt"
with open(filepath) as from_file, open(filepath+".new","w") as to_file:
data = from_file.readline()
to_file.write("something else\n")
shutil.copyfileobj(from_file, to_file)
os.remove(filepath)
os.rename(filepath+".new",filepath)
This doesn't take much longer, because the rename operation is instantaneous. Besides, if the program/computer crashes at any point, one of the files (old or new) is valid, so it's safe.
if patterns have the same length, use read/write mode
like this:
filepath = "test.txt"
with open(filepath,"r+") as rw_file:
data = rw_file.readline()
data = "h"*(len(data)-1) + "\n"
rw_file.seek(0)
rw_file.write(data)
Here we, read the line, replace the first line by the same amount of h characters, rewind the file and write the first line back, overwriting previous contents, keeping the rest of the lines. This is also safe, and even if the file is huge, it's very fast. The only constraint is that the pattern must be of the exact same size (else you would have remainders of the previous data, or you would overwrite the next line(s) since no data is shifted)

Using 'r+' mode to overwrite a line in a file with another line of the same length

I have a file called vegetables:
carrots
apples_
cucumbers
What I want to do is open the file in python, and modify it in-place, without overwriting large portions of the file. Specifically, I want to overwrite apples_ with lettuce, such that the file would look like this:
carrots
lettuce
cucumbers
To do this, I've been told to use 'r+' mode. However, I don't know how to overwrite that line in place. Is that possible? All the solutions I am familiar with involve caching the entire file, and then overwriting the entire file, for a small amendment. Is this really the best option?
Important note: the replacement line is always the same length as the original line.
For context: I'm not really concerned with a file on vegetables. Rather, I have a textfile of about 400 lines to which I need to make revisions roughly every two minutes. I have a script to do this, but I want to do it more efficiently.

an answer that works with your example
with open("vegetables","r+") as t:
data = t.read()
t.seek(data.index("apples_"))
t.write("lettuce")
although, it might not be worth it to complicate things like this,
it's fine to just read the entire file, and then overwriting the entire file, you aren't going to save much by doing something like my example
NOTE: this only works if it has the exactly the same length as the original text you are replacing
edit1: a (possibly bad) example to replace all match:
import re
with open("test","r+") as t:
data = t.read()
for m in re.finditer("apples_", data):
t.seek(m.start())
t.write("lettuce")
edit2: something a little more complex using closure so that it can check for multiple words to replace
import re
def get_find_and_replace(f):
"""f --> a file that is open with r+ mode"""
data = f.read()
def find_and_replace(old, new):
for m in re.finditer(old, data):
f.seek(m.start())
f.write(new)
return find_and_replace
with open("test","r+") as f:
find_and_replace = get_find_and_replace(f)
find_and_replace("apples_","lettuce")
#find_and_replace(...,...)
#find_and_replace(...,...)

If I understanding you correctly fileinput.input should work providing the string is not a substring of another:
import fileinput
for line in fileinput.input("in.txt",inplace=True):
print(line.rstrip().replace("apples_","lettuce"))
print(line.rstrip().replace("apples_","lettuce")) actually writes to the file inplace it does not print the line.
you can also check for multiple words to replace in one pass:
old = "apples_"
for line in fileinput.input("in.txt",inplace=True):
if line.rstrip() == old:
print(line.rstrip().replace(old,"lettuce"))
elif ....
elif....
else:
print(line.rstrip())

How to change the field separator of a file using Python?

I'm new to Python from the R world, and I'm working on big text files, structured in data columns (this is LiDaR data, so generally 60 million + records).
Is it possible to change the field separator (eg from tab-delimited to comma-delimited) of such a big file without having to read the file and do a for loop on the lines?

No.
Read the file in
Change separators for each line
Write each line back
This is easily doable with just a few lines of Python (not tested but the general approach works):
# Python - it's so readable, the code basically just writes itself ;-)
#
with open('infile') as infile:
with open('outfile', 'w') as outfile:
for line in infile:
fields = line.split('\t')
outfile.write(','.join(fields))
I'm not familiar with R, but if it has a library function for this it's probably doing exactly the same thing.
Note that this code only reads one line at a time from the file, so the file can be larger than the physical RAM - it's never wholly loaded in.

You can use the linux tr command to replace any character with any other character.

Actually lets say yes, you can do it without loops eg:
with open('in') as infile:
with open('out', 'w') as outfile:
map(lambda line: outfile.write(','.join(line.split('\n'))), infile)

You cant, but i strongly advise you to check generators.
Point is that you can make faster and well structured program without need to write and store data in memory in order to process it.
For instance
file = open("bigfile","w")
j = (i.split("\t") for i in file)
s = (","join(i) for i in j)
#and now magic happens
for i in s:
some_other_file.write(i)
This code spends memory for holding only single line.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python : Text Replacement In Large Files - python

Related

How to avoid file corruption in Python?

Search for a word, and modify the whole line in Python text processing

Selectively replacing csv header names

Using 'r+' mode to overwrite a line in a file with another line of the same length

How to change the field separator of a file using Python?

Categories

Resources