I have a reasonably long array of events (each one being an array line), stored as numpy.ndarray with pickle.dump. Is there any way to efficiently iterate over a serialized object? The idea is very similar to iterating over a text file:
with open('my_file.txt', 'r') as FILE:
for line in FILE:
do_something(line)
Im just concerned about not loading the entire object into memory at once.
Related
I'm confused as to why this isn't wanting to write to an outfile.
I extract data from a txt file using np.loadtxt() I try to write to an existing file, but I'm getting this error 'numpy.float64' object is not iterable' I'm not looping through a float value, rather looping through each element of the array then writing it to an existing file.
Here's my code
mass = np.loadtxt('C:/Users/Luis/Desktop/data.txt',usecols=0)
with open('test','w') as outfile:
for i in range(len(mass)):
outfile.writelines(mass[i])
could it be that the function with open() doesn't work with NumPy arrays?
Thanks
with open() is a context manager used to work with files (works with all files). The writelines() method takes str as input argument. So you do have to convert to writelines(str(mass[i])). But still the temp variable i is iterating over the line but the max value it takes is length of the file.(length of the file may not be equal to the length of each line in the file). I think this is what you want.
mass = np.loadtxt('/content/sample_data/st.txt',usecols=0)
with open('test.txt','w') as outfile:
for line in mass:
outfile.writelines(str(line))
So i want to write each element of a list in a new line in a binary file using Pickle, i want to be able to access these dictionaries later as well.
import pickle
with open(r'student.dat','w+b') as file:
for i in [{1:11},{2:22},{3:33},{4:44}]:
pickle.dump(i,file)
file.seek(0)
print(pickle.load(file))
Output:
{1: 11}
Could someone explain why the rest of the elements arent being dumped or suggest another way to write in a new line?
I'm using Python 3
They're all being dumped, but each dump is separate; to load them all, you need to match them with load calls.
import pickle
with open(r'student.dat','w+b') as file:
for i in [{1:11},{2:22},{3:33},{4:44}]:
pickle.dump(i,file)
file.seek(0)
print(pickle.load(file))
import pickle
with open(r'student.dat','w+b') as file:
for i in [{1:11},{2:22},{3:33},{4:44}]:
pickle.dump(i,file)
file.seek(0)
for _ in range(4):
print(pickle.load(file))
If you don't want to perform multiple loads, pickle them as a single data structure (e.g. the original list of dicts all at once).
In none of these cases are you writing newlines, nor should you be; pickle is a binary protocol, which means newlines are just another byte with independent meaning, and trying to inject newlines into the stream would get in the way of loading the data, and risk splitting up bits of data (if you actually read a line at a time for loading).
I'm running 64-bit Python 3 on Linux, and I have a code that generates lists with about 20,000 elements. A memory error occurred when my code tried to write a list of ~20,000 2D arrays to a binary file via the pickle module, but it generated all of these arrays and appended them to this list without a problem. I know this must take up a lot of memory, but the machine I'm using has about 100GB available (from the command free -m). The line with the error:
with open('all_data.data', 'wb') as f:
pickle.dump(data, f)
>>> MemoryError
where data is my list of ~20,000 numpy arrays. Also, previously I was trying to run this code with about 55,000 elements, but while it was 40% of the way through with appending all the arrays to the data list, it just output Killed by itself. So now I'm trying to break it into segments, but this time I get a MemoryError. How can I bypass this? I was also informed that I have access to multiple CPUs, but I have no idea how to take advantage of these (I don't yet understand multiprocessing).
Pickle will try to parse all your data, and likely convert it to intermediate states before writing everything to disk - so if you are using about half your available memory, it will blow up.
Since your data is already on a list, an easy workaround there is to pickle each array, and store it, instead of trying to serialize the 20000 arrays in a single go:
with open('all_data.data', 'wb') as f:
for item in data:
pickle.dump(item, f)
Then, to read it back, just keep unpickling objects from your file and appending then to a list, until the file is exhausted:
data = []
with open('all_data.data', 'rb') as f:
while True:
try:
data.append(pickle.load(f))
except EOFError:
break
This works because unpicking from a file is quite well behaved: the file pointer stays exactly at the point a pickled object stored in the file ends - further reads therefore start at the beginning of the next object.
I want to generate a log file in which I have to print two lists for about 50 input files. So, there are approximately 100 lists reported in the log file. I tried using pickle.dump, but it adds some strange characters in the beginning of each value. Also, it writes each value in a different line and the enclosing brackets are also not shown.
Here is a sample output from a test code.
import pickle
x=[1,2,3,4]
fp=open('log.csv','w')
pickle.dump(x,fp)
fp.close()
output:
I want my log file to report:
list 1 is: [1,2,3,4]
If you want your log file to be readable, you are approaching it the wrong way by using pickle which "implements binary protocols"--i.e. it is unreadable.
To get what you want, replace the line
pickle.dump(x,fp)
with
fp.write(' list 1 is: '
fp.write(str(x))
This requires minimal change in the rest of your code. However, good practice would change your code to a better style.
pickle is for storing objects in a form which you could use to recreate the original object. If all you want to do is create a log message, the builtin __str__ method is sufficient.
x = [1, 2, 3, 4]
with open('log.csv', 'w') as fp:
print('list 1 is: {}'.format(x), file=fp)
Python's pickle is used to serialize objects, which is basically a way that an object and its hierarchy can be stored on your computer for use later.
If your goal is to write data to a csv, then read the csv file and output what you read inside of it, then read below.
Writing To A CSV File see here for a great tutorial if you need more info
import csv
list = [1,2,3,4]
myFile = open('yourFile.csv', 'w')
writer = csv.writer(myFile)
writer.writerow(list)
the function writerow() will write each element of an iterable (each element of the list in your case) to a column. You can run through each one of your lists and write it to its own row in this way. If you want to write multiple rows at once, check out the method writerows()
Your file will be automatically saved when you write.
Reading A CSV File
import csv
with open('example.csv', newline='') as File:
reader = csv.reader(File)
for row in reader:
print(row)
This will run through all the rows in your csv file and will print it to the console.
I have a file - which I read it into memory as a list, then split the list based on some rule, say there is list1, list2, .., listn. Now I want to get the size of each list, and this size is the file size when this list write to a file. the following is a code I have, the file name is 'temp', which size is: 744 bytes.
from os import stat
from sys import getsizeof
print(stat('temp').st_size) # we get exactly 744 here.
# Now read file into a list and use getsizeof() function:
with open('temp', 'r') as f:
chunks = f.readlines()
print(getsizeof(chunks)) # here i get 240, which is quite different than 744.
since I can't use getsizeof() to directly get the file size (on disk), so once i get the split list, i have to write this list to a tmp file:
open('tmp','w').write("".join(list1))
print(stat('tmp','w').st_size) # Here is the value I want.
os.remove('tmp')
this solution is very slow and require a lot of write/read to disk. Is there any better way to do? thanks a lot!
Instead of writing a series of bytes to a file and then looking at the file length1, you could just check the length of the string that you would have written to the file:
print(len("".join(list1)))
Here, I'm assuming that your list contains byte strings. If it doesn't, you can always encode a byte string from your unicode string:
print(len("".join(list1).encode(your_codec)))
which I think you would need for write to work properly anyway in your original solution.
1Your original code could also give flaky (wrong!) results since you never close the file. It's not guaranteed that all the contents of the string will be written to the file when you use os.stat on it due to buffering.