I have a list of JSON objects stored as a text file, one JSON object per line (total size is 30 GB), and what I'm trying to do is extract elements from those objects and store them in a new list. Here is my code to do that
print("Extracting fingerprints...")
start = time.time()
for jsonObj in open('ctl_records_sample.jsonlines'):
temp_dict = {}
temp_dict = json.loads(jsonObj)
finger = temp_dict['data']['leaf_cert']['fingerprint']
with open("fingerprints.txt", "w") as f:
f.write(finger+"\n")
finger = ""
end = time.time()
print("Fingerprint extraction finished in" + str(end-start) +"s")
Basically, I'm trying to go line-by-line of the original file and write that line's "fingerprint" to the new text file. However, after letting the code run for several seconds, I open up fingerprints.txt and see that only one fingerprint has been written to the file. Any idea what could be happening?
Your code here is the issue:
with open("fingerprints.txt", "w") as f:
f.write(finger+"\n")
The "w" part will truncate file each time it's opened.
You either want to open the file and keep it open throughout your loop, or check that the file exists and if it does open it with "a" to append.
You're opening the file in each loop iteration, in write mode as per your w parameter passed to the open function. Therefore it's being overwritten from the beginning.
You can solve it for example with two different approaches:
You can move your with statement before the for loop and everything will work, since it will be writing sequentially over the same file (using the same descriptor and pointer into the file).
Open the file in append mode each time, what will append your new written content to the end of the file. To do so, replace your w with an a.
When calling open() with the "w" mode, all the file contents will be deleted. From the Python documentation for the open() function:
'w': open for writing, truncating the file first
I think you are looking to use the "a" mode, which appends new contents to the end of the file:
'a': open for writing, appending to the end of the file if it exists
with open("fingerprints.txt", "a", newline="\n") as f:
f.write(finger)
(You can also drop the +"\n" to the f.write() call by passing the newline="\n" argument to open().)
Related
okay, this may have been talked about before, but I am unable to find it anywhere on stack so here i am.
Basically I am writing a script that will take a .txt document and store every other line (even lines say) and print them into a new text document.
I was able to successfully write my code to scan the text and remove the even numbered lines and put them into a list as independent variables but when i got to add each item of the list to the new text documents, depending on where i do that i get either the first line or the last line but never more than one.
here is what i have
f = open('stuffs.txt', 'r')
i = 1
x = []
for line in f.readlines():
if i % 2 == 0:
x.append(line)
i += 1
I have tested that this successfully takes the proper lines and stores them in list x
i have tried
for m in x:
t = open('stuffs2.txt','w')
t.write(m)
directly after, and it only prints the last line
if i do
for line in f.readlines():
if i % 2 == 0:
t = open('stuffs2.txt','w')
t.write(line)
i += 1
it will print the first line
if i try to add the first solution to the for loop as a nested for loop it will also print the first line. I have no idea why it is not taking each individual item and putting it in the .txt
when i print the list it is in there as it should be.
Did look for a canonical - did not find one...
open('stuffs2.txt','w') - "w" == kill whats there and open new empty file ...
Read the documentation: reading-and-writing-files :
7.2. Reading and Writing Files
open() returns a file object, and is most commonly used with two arguments: open(filename, mode). f = open('workfile', 'w')
The first argument is a string containing the filename.
The second argument is another string
containing a few characters describing the way in which the file will
be used.
mode can be 'r' when the file will only be read, 'w' for only
writing (an existing file with the same name will be erased), and 'a'
opens the file for appending; any data written to the file is
automatically added to the end. 'r+' opens the file for both reading
and writing. The mode argument is optional; 'r' will be assumed if
it’s omitted.
To write every 2nd line more economically:
with open("file.txt") as f, open("target.txt","w") as t:
write = True
for line in f:
if write:
t.write(line)
write = not write
this way you do not need to store all lines in memory.
The with open(...) as name : syntax is also better - it will close your filehandle (which you do not do) even if exceptions arise.
If I want to open a file, unpickle an object inside it, then overwrite it later, is it okay to just use
data = {} #Its a dictionary in my code
file = open("filename","wb")
data = pickle.load(file)
data["foo"] = "bar"
pickle.dump(data,file)
file.close()
Or would I have to use "rb" first and then use "wb" later (using with statements for each) which is what I am doing now. Note that in my program, there is a hashing algorithm in between opening the file and closing it, which is where the dictionary data comes from, and I basically want to be able to only open the file once without having to do two with statements
If you want to read, then write the file, do not use modes involving w at all; all of them truncate the file on opening it.
If the file is known to exist, use mode "rb+", which opens an existing file for both read and write.
Your code only needs to change a tiny bit:
# Open using with statement to ensure prompt/proper closing
with open("filename","rb+") as file:
data = pickle.load(file) # Load from file (moves file pointer to end of file)
data["foo"] = "bar"
file.seek(0) # Move file pointer back to beginning of file
pickle.dump(data, file) # Write new data over beginning of file
file.truncate() # If new dump is smaller, make sure to chop off excess data
You can use wb+ which opens the file for both reading and writing
This question is helpful for understanding the differences between each of pythons read and write conditions, but adding + at the end usually always opens the file for both read and write
Confused by python file mode "w+"
Started Python a week ago and I have some questions to ask about reading and writing to the same files. I've gone through some tutorials online but I am still confused about it. I can understand simple read and write files.
openFile = open("filepath", "r")
readFile = openFile.read()
print readFile
openFile = open("filepath", "a")
appendFile = openFile.write("\nTest 123")
openFile.close()
But, if I try the following I get a bunch of unknown text in the text file I am writing to. Can anyone explain why I am getting such errors and why I cannot use the same openFile object the way shown below.
# I get an error when I use the codes below:
openFile = open("filepath", "r+")
writeFile = openFile.write("Test abc")
readFile = openFile.read()
print readFile
openFile.close()
I will try to clarify my problems. In the example above, openFile is the object used to open file. I have no problems if I want write to it the first time. If I want to use the same openFile to read files or append something to it. It doesn't happen or an error is given. I have to declare the same/different open file object before I can perform another read/write action to the same file.
#I have no problems if I do this:
openFile = open("filepath", "r+")
writeFile = openFile.write("Test abc")
openFile2 = open("filepath", "r+")
readFile = openFile2.read()
print readFile
openFile.close()
I will be grateful if anyone can tell me what I did wrong here or is it just a Pythong thing. I am using Python 2.7. Thanks!
Updated Response:
This seems like a bug specific to Windows - http://bugs.python.org/issue1521491.
Quoting from the workaround explained at http://mail.python.org/pipermail/python-bugs-list/2005-August/029886.html
the effect of mixing reads with writes on a file open for update is
entirely undefined unless a file-positioning operation occurs between
them (for example, a seek()). I can't guess what
you expect to happen, but seems most likely that what you
intend could be obtained reliably by inserting
fp.seek(fp.tell())
between read() and your write().
My original response demonstrates how reading/writing on the same file opened for appending works. It is apparently not true if you are using Windows.
Original Response:
In 'r+' mode, using write method will write the string object to the file based on where the pointer is. In your case, it will append the string "Test abc" to the start of the file. See an example below:
>>> f=open("a","r+")
>>> f.read()
'Test abc\nfasdfafasdfa\nsdfgsd\n'
>>> f.write("foooooooooooooo")
>>> f.close()
>>> f=open("a","r+")
>>> f.read()
'Test abc\nfasdfafasdfa\nsdfgsd\nfoooooooooooooo'
The string "foooooooooooooo" got appended at the end of the file since the pointer was already at the end of the file.
Are you on a system that differentiates between binary and text files? You might want to use 'rb+' as a mode in that case.
Append 'b' to the mode to open the file in binary mode, on systems
that differentiate between binary and text files; on systems that
don’t have this distinction, adding the 'b' has no effect.
http://docs.python.org/2/library/functions.html#open
Every open file has an implicit pointer which indicates where data will be read and written. Normally this defaults to the start of the file, but if you use a mode of a (append) then it defaults to the end of the file. It's also worth noting that the w mode will truncate your file (i.e. delete all the contents) even if you add + to the mode.
Whenever you read or write N characters, the read/write pointer will move forward that amount within the file. I find it helps to think of this like an old cassette tape, if you remember those. So, if you executed the following code:
fd = open("testfile.txt", "w+")
fd.write("This is a test file.\n")
fd.close()
fd = open("testfile.txt", "r+")
print fd.read(4)
fd.write(" IS")
fd.close()
... It should end up printing This and then leaving the file content as This IS a test file.. This is because the initial read(4) returns the first 4 characters of the file, because the pointer is at the start of the file. It leaves the pointer at the space character just after This, so the following write(" IS") overwrites the next three characters with a space (the same as is already there) followed by IS, replacing the existing is.
You can use the seek() method of the file to jump to a specific point. After the example above, if you executed the following:
fd = open("testfile.txt", "r+")
fd.seek(10)
fd.write("TEST")
fd.close()
... Then you'll find that the file now contains This IS a TEST file..
All this applies on Unix systems, and you can test those examples to make sure. However, I've had problems mixing read() and write() on Windows systems. For example, when I execute that first example on my Windows machine then it correctly prints This, but when I check the file afterwards the write() has been completely ignored. However, the second example (using seek()) seems to work fine on Windows.
In summary, if you want to read/write from the middle of a file in Windows I'd suggest always using an explicit seek() instead of relying on the position of the read/write pointer. If you're doing only reads or only writes then it's pretty safe.
One final point - if you're specifying paths on Windows as literal strings, remember to escape your backslashes:
fd = open("C:\\Users\\johndoe\\Desktop\\testfile.txt", "r+")
Or you can use raw strings by putting an r at the start:
fd = open(r"C:\Users\johndoe\Desktop\testfile.txt", "r+")
Or the most portable option is to use os.path.join():
fd = open(os.path.join("C:\\", "Users", "johndoe", "Desktop", "testfile.txt"), "r+")
You can find more information about file IO in the official Python docs.
Reading and Writing happens where the current file pointer is and it advances with each read/write.
In your particular case, writing to the openFile, causes the file-pointer to point to the end of file. Trying to read from the end would result EOF.
You need to reset the file pointer, to point to the beginning of the file before through seek(0) before reading from it
You can read, modify and save to the same file in python but you have actually to replace the whole content in file, and to call before updating file content:
# set the pointer to the beginning of the file in order to rewrite the content
edit_file.seek(0)
I needed a function to go through all subdirectories of folder and edit content of the files based on some criteria, if it helps:
new_file_content = ""
for directories, subdirectories, files in os.walk(folder_path):
for file_name in files:
file_path = os.path.join(directories, file_name)
# open file for reading and writing
with io.open(file_path, "r+", encoding="utf-8") as edit_file:
for current_line in edit_file:
if condition in current_line:
# update current line
current_line = current_line.replace('john', 'jack')
new_file_content += current_line
# set the pointer to the beginning of the file in order to rewrite the content
edit_file.seek(0)
# delete actual file content
edit_file.truncate()
# rewrite updated file content
edit_file.write(new_file_content)
# empties new content in order to set for next iteration
new_file_content = ""
edit_file.close()
I'm new to Python and am struggling to understand why this program
#!/usr/bin/env python
infile = open('/usr/src/scripts/in_file.conf')
outfile = open('/usr/src/scripts/in_file.conf', 'w')
replacements = {'abcd':'ABCD', '1234':'bob'}
for line in infile:
for src, target in replacements.items():
line = line.replace(src, target)
outfile.write(line)
infile.close()
outfile.close()
results in a blank file after script execution.
The original in_file.conf is:
testfile of junk
abcd
******************
1234
*************
Correct me if i'm wrong, but it is my understanding that the script opens the in_file.conf and loads the contents into two temporary files in memory, infile & outfile. the dictionary type variable replacements acts like an array to hold the "to find" and to "replace" string.
It loops over each line then a nested loop goes down the line and loads the variables src and target with the contents of the replacement variable (like an array); then writes the line, until all the lines are written.
Am I way off in my understanding?
The in_file.conf is in the same directory as the script, could it just not finding the in_file.conf and writing a blank file?
I told you i was new to python.
Kind Regards,
Reggie.
The problem is that you're opening the same file in read mode and then in write mode (which truncates the file). You should ideally have a different file for the output, but if you need the output to be in the same file, you can delete the old file and rename the new one afterwards.
Please use different files for infile and outfile. Opening a file in write mode will delete its contents. Because your infile and outfile are the same files, your file contents is deleted and your for loop is never run
Started Python a week ago and I have some questions to ask about reading and writing to the same files. I've gone through some tutorials online but I am still confused about it. I can understand simple read and write files.
openFile = open("filepath", "r")
readFile = openFile.read()
print readFile
openFile = open("filepath", "a")
appendFile = openFile.write("\nTest 123")
openFile.close()
But, if I try the following I get a bunch of unknown text in the text file I am writing to. Can anyone explain why I am getting such errors and why I cannot use the same openFile object the way shown below.
# I get an error when I use the codes below:
openFile = open("filepath", "r+")
writeFile = openFile.write("Test abc")
readFile = openFile.read()
print readFile
openFile.close()
I will try to clarify my problems. In the example above, openFile is the object used to open file. I have no problems if I want write to it the first time. If I want to use the same openFile to read files or append something to it. It doesn't happen or an error is given. I have to declare the same/different open file object before I can perform another read/write action to the same file.
#I have no problems if I do this:
openFile = open("filepath", "r+")
writeFile = openFile.write("Test abc")
openFile2 = open("filepath", "r+")
readFile = openFile2.read()
print readFile
openFile.close()
I will be grateful if anyone can tell me what I did wrong here or is it just a Pythong thing. I am using Python 2.7. Thanks!
Updated Response:
This seems like a bug specific to Windows - http://bugs.python.org/issue1521491.
Quoting from the workaround explained at http://mail.python.org/pipermail/python-bugs-list/2005-August/029886.html
the effect of mixing reads with writes on a file open for update is
entirely undefined unless a file-positioning operation occurs between
them (for example, a seek()). I can't guess what
you expect to happen, but seems most likely that what you
intend could be obtained reliably by inserting
fp.seek(fp.tell())
between read() and your write().
My original response demonstrates how reading/writing on the same file opened for appending works. It is apparently not true if you are using Windows.
Original Response:
In 'r+' mode, using write method will write the string object to the file based on where the pointer is. In your case, it will append the string "Test abc" to the start of the file. See an example below:
>>> f=open("a","r+")
>>> f.read()
'Test abc\nfasdfafasdfa\nsdfgsd\n'
>>> f.write("foooooooooooooo")
>>> f.close()
>>> f=open("a","r+")
>>> f.read()
'Test abc\nfasdfafasdfa\nsdfgsd\nfoooooooooooooo'
The string "foooooooooooooo" got appended at the end of the file since the pointer was already at the end of the file.
Are you on a system that differentiates between binary and text files? You might want to use 'rb+' as a mode in that case.
Append 'b' to the mode to open the file in binary mode, on systems
that differentiate between binary and text files; on systems that
don’t have this distinction, adding the 'b' has no effect.
http://docs.python.org/2/library/functions.html#open
Every open file has an implicit pointer which indicates where data will be read and written. Normally this defaults to the start of the file, but if you use a mode of a (append) then it defaults to the end of the file. It's also worth noting that the w mode will truncate your file (i.e. delete all the contents) even if you add + to the mode.
Whenever you read or write N characters, the read/write pointer will move forward that amount within the file. I find it helps to think of this like an old cassette tape, if you remember those. So, if you executed the following code:
fd = open("testfile.txt", "w+")
fd.write("This is a test file.\n")
fd.close()
fd = open("testfile.txt", "r+")
print fd.read(4)
fd.write(" IS")
fd.close()
... It should end up printing This and then leaving the file content as This IS a test file.. This is because the initial read(4) returns the first 4 characters of the file, because the pointer is at the start of the file. It leaves the pointer at the space character just after This, so the following write(" IS") overwrites the next three characters with a space (the same as is already there) followed by IS, replacing the existing is.
You can use the seek() method of the file to jump to a specific point. After the example above, if you executed the following:
fd = open("testfile.txt", "r+")
fd.seek(10)
fd.write("TEST")
fd.close()
... Then you'll find that the file now contains This IS a TEST file..
All this applies on Unix systems, and you can test those examples to make sure. However, I've had problems mixing read() and write() on Windows systems. For example, when I execute that first example on my Windows machine then it correctly prints This, but when I check the file afterwards the write() has been completely ignored. However, the second example (using seek()) seems to work fine on Windows.
In summary, if you want to read/write from the middle of a file in Windows I'd suggest always using an explicit seek() instead of relying on the position of the read/write pointer. If you're doing only reads or only writes then it's pretty safe.
One final point - if you're specifying paths on Windows as literal strings, remember to escape your backslashes:
fd = open("C:\\Users\\johndoe\\Desktop\\testfile.txt", "r+")
Or you can use raw strings by putting an r at the start:
fd = open(r"C:\Users\johndoe\Desktop\testfile.txt", "r+")
Or the most portable option is to use os.path.join():
fd = open(os.path.join("C:\\", "Users", "johndoe", "Desktop", "testfile.txt"), "r+")
You can find more information about file IO in the official Python docs.
Reading and Writing happens where the current file pointer is and it advances with each read/write.
In your particular case, writing to the openFile, causes the file-pointer to point to the end of file. Trying to read from the end would result EOF.
You need to reset the file pointer, to point to the beginning of the file before through seek(0) before reading from it
You can read, modify and save to the same file in python but you have actually to replace the whole content in file, and to call before updating file content:
# set the pointer to the beginning of the file in order to rewrite the content
edit_file.seek(0)
I needed a function to go through all subdirectories of folder and edit content of the files based on some criteria, if it helps:
new_file_content = ""
for directories, subdirectories, files in os.walk(folder_path):
for file_name in files:
file_path = os.path.join(directories, file_name)
# open file for reading and writing
with io.open(file_path, "r+", encoding="utf-8") as edit_file:
for current_line in edit_file:
if condition in current_line:
# update current line
current_line = current_line.replace('john', 'jack')
new_file_content += current_line
# set the pointer to the beginning of the file in order to rewrite the content
edit_file.seek(0)
# delete actual file content
edit_file.truncate()
# rewrite updated file content
edit_file.write(new_file_content)
# empties new content in order to set for next iteration
new_file_content = ""
edit_file.close()