I have a file test.txt which has an array:
array = [3,5,6,7,9,6,4,3,2,1,3,4,5,6,7,8,5,3,3,44,5,6,6,7]
Now what I want to do is get the content of array and perform some calculations with the array. But the problem is when I do open("test.txt") it outputs the content as the string. Actually the array is very big, and if I do a loop it might not be efficient. Is there any way to get the content without splitting , ? Any new ideas?
I recommend that you save the file as json instead, and read it in with the json module. Either that, or make it a .py file, and import it as python. A .txt file that looks like a python assignment is kind of odd.
Does your text file need to look like python syntax? A list of comma separated values would be the usual way to provide data:
1,2,3,4,5
Then you could read/write with the csv module or the numpy functions mentioned above. There's a lot of documentation about how to read csv data in efficiently. Once you had your csv reader data object set up, data could be stored with something like:
data = [ map( float, row) for row in csvreader]
If you want to store a python-like expression in a file, store only the expression (i.e. without array =) and parse it using ast.literal_eval().
However, consider using a different format such as JSON. Depending on the calculations you might also want to consider using a format where you do not need to load all data into memory at once.
Must the array be saved as a string? Could you use a pickle file and save it as a Python list?
If not, could you try lazy evaluation? Maybe only process sections of the array as needed.
Possibly, if there are calculations on the entire array that you must always do, it might be a good idea to pre-compute those results and store them in the txt file either in addition to the list or instead of the list.
You could also use numpy to load the data from the file using numpy.genfromtxt or numpy.loadtxt. Both are pretty fast and both have the ability to do the recasting on load. If the array is already loaded though, you can use numpy to convert it to an array of floats, and that is really fast.
import numpy as np
a = np.array(["1", "2", "3", "4"])
a = a.astype(np.float)
You could write a parser. They are very straightforward. And much much faster than regular expressions, please don't do that. Not that anyone suggested it.
# open up the file (r = read-only, b = binary)
stream = open("file_full_of_numbers.txt", "rb")
prefix = '' # end of the last chunk
full_number_list = []
# get a chunk of the file at a time
while True:
# just a small 1k chunk
buffer = stream.read(1024)
# no more data is left in the file
if '' == buffer:
break
# delemit this chunk of data by a comma
split_result = buffer.split(",")
# append the end of the last chunk to the first number
split_result[0] = prefix + split_result[0]
# save the end of the buffer (a partial number perhaps) for the next loop
prefix = split_result[-1]
# only work with full results, so skip the last one
numbers = split_result[0:-1]
# do something with the numbers we got (like save it into a full list)
full_number_list += numbers
# now full_number_list contains all the numbers in text format
You'll also have to add some logic to use the prefix when the buffer is blank. But I'll leave that code up to you.
OK, so the following methods ARE dangerous. Since they are used to attack systems by injecting code into them, used them at your own risk.
array = eval(open("test.txt", 'r').read().strip('array = '))
execfile('test.txt') # this is the fastest but most dangerous.
Safer methods.
import ast
array = ast.literal_eval(open("test.txt", 'r').read().strip('array = ')).
...
array = [float(value) for value in open('test.txt', 'r').read().strip('array = [').strip('\n]').split(',')]
The eassiest way to serialize python objects so you can load them later is to use pickle. Assuming you dont want a human readable format since this adds major head, either-wise, csv is fast and json is flexible.
import pickle
import random
array = random.sample(range(10**3), 20)
pickle.dump(array, open('test.obj', 'wb'))
loaded_array = pickle.load(open('test.obj', 'rb'))
assert array == loaded_array
pickle does have some overhead and if you need to serialize large objects you can specify the compression ratio, the default is 0 no compression, you can set it to pickle.HIGHEST_PROTOCOL pickle.dump(array, open('test.obj', 'wb'), pickle.HIGHEST_PROTOCOL)
If you are working with large numerical or scientific data sets then use numpy.tofile/numpy.fromfile or scipy.io.savemat/scipy.io.loadmat they have little overhead, but again only if you are already using numpy/scipy.
good luck.
Related
I'm going to translate the working matlab code for reading the binary file to python code. Is there an equivalent for
% open the file for reading
fid=fopen (filename,'rb','ieee-le');
% first read the signature
tmp=fread(fid,2,'char');
% read sizes
rows=fread(fid,1,'ushort');
cols=fread(fid,1,'ushort');
there's the struct module to do that, specifically the unpack function which accepts a buffer, but you'll have to read the required size from the input using struct.calcsize
import struct
endian = "<" # little endian
with open(filename,'rb') as f:
tmp = struct.unpack(f"{endian}cc",f.read(struct.calcsize("cc")))
tmp_int = [int.from_bytes(x,byteorder="little") for x in tmp]
rows = struct.unpack(f"{endian}H",f.read(struct.calcsize("H")))[0]
cols = struct.unpack(f"{endian}H",f.read(struct.calcsize("H")))[0]
you might want to use the struct.Struct class for reading the rest of the data in chunks, as it is going to be faster than decoding numbers one at a time.
ie:
data = []
reader = struct.Struct(endian + "i"*cols) # i for integer
row_size = reader.size
for row_number in range(rows):
row = reader.unpack(f.read(row_size))
data.append(row)
Edit: corrected the answer, and added an example for larger chuncks.
Edit2: okay, more improvement, assuming we are reading 1 GB file of shorts, storing it as python int makes no sense and will most likely give an out of memory error (or system will freeze), the proper way to do it is using numpy
import numpy as np
data = np.fromfile(f,dtype=endian+'H').reshape(cols,rows) # ushorts
this way it'll have the same space in memory as it did on disk.
I am trying to save variables after I run a program to a text file and read them in different module, to call them back in the original program. Point of that is to write plots with 4 different outcome of the main program.
attempt at coding
#main program
a = array([[0.05562032, 0.05386903, 0.05216994, 0.03045489, 0.03029977,
0.03014554],
[0. , 0.00175129, 0.00345037, 0.15353227, 0.1536874 ,
0.15384163]])
#save paramaters in external file
save_paramaters = open('save.txt','w')
save_paramaters.write(str(a))
save_paramaters.close()
I open the txt file in python module and save it as a variable, which I corrected manually(replacing spaces with commas)
#new program
dat = "save.txt"
b = open(dat, "r")
c = array(b.read())
In the main program I now call the variable with
a = array([[0.05562032, 0.05386903, 0.05216994, 0.03045489, 0.03029977,
0.03014554],
[0. , 0.00175129, 0.00345037, 0.15353227, 0.1536874 ,
0.15384163])
#save paramaters in external file
save_paramaters = open('save.txt','w')
save_paramaters.write(str(a))
save_paramaters.close()
#open the variable
from program import c
from matplotlib.pyplot import figure, plot
#and try to plot it
plot(c[1][:], label ='results2')
plot(c[0][:], label ='results1')
File "/Example.py", line 606, in example
plot(c[1][:], label ='results2') #model
IndexError: too many indices for array
If you want to save an array you can't just save it as text and expect python to figure it out. When you read it, you're reading it as text (as a string) and that's all your program can know.
If you want to save complex objects you have several other options:
You can save text (as you do) but parse it manually when reading it to turn it into an array. This is complex to write without bugs and will get even more complex if you have anything even more complex than an array.
You can save it using pickle - while this is a good solution for almost all objects, the file created wouldn't be readable to humans, and that's perhaps not what you want.
A good middle ground is to save objects as JSON - this is a standard for most datatypes and would work beautifully for dicts and lists and tuples (but will fail with more complex objects), and more importantly, it will be readable to humans such as yourself.
Let's say you go with JSON. You save a list like this:
import json
with open('save.txt','w') as f:
json.dump(your_object, f)
As simple as that. To read back the list:
import json
with open('save.txt','r') as f:
your_new_object = json.load(f)
This is fairly simple isn't it? Notice I used a with statement to open the files to make sure they close properly as well, but that's also more simple to write. Using pickles is fairly similar and even has the same syntax, but objects are saved as bytes and not text (so you have to use 'rb' and 'wb' modes on files to read and write, respectively).
To do the same thing with numpy array, we can also use numpy.save:
np.save('save', your_numpy_array)
And we read it back (with a npy extension) with numpy.load:
your_array = np.load('save.npy')
In readability terms, opening the file would be semi-readable (less than JSON, more than pickle)
I have hundred of thousands of data text files to read. As of now, I'm importing the data from text files every time I run the code. Perhaps the easy solution would be to simply reformat the data into a file faster to read.
Anyway, right now every text files I have look like:
User: unknown
Title : OE1_CHANNEL1_20181204_103805_01
Sample data
Wavelength OE1_CHANNEL1
185.000000 27.291955
186.000000 27.000877
187.000000 25.792290
188.000000 25.205620
189.000000 24.711882
.
.
.
The code where I read and import the txt files is:
# IMPORT DATA
path = 'T2'
if len(sys.argv) == 2:
path = sys.argv[1]
files = os.listdir(path)
trans_import = []
for index, item in enumerate(files):
trans_import.append(np.loadtxt(path+'/'+files[1], dtype=float, skiprows=4, usecols=(0,1)))
The resulting array looks in the variable explorer as:
{ndarray} = [[185. 27.291955]\n [186. 27.000877]\n ... ]
I'm wondering, how I could speed up this part? It takes a little too long as of now just to import ~4k text files. There are 841 lines inside every text files (spectrum). The output I get with this code is 841 * 2 = 1682. Obviously, it considers the \n as a line...
It would probably be much faster if you had one large file instead of many small ones. This is generally more efficient. Additionally, you might get a speedup from just saving the numpy array directly and loading that .npy file in instead of reading in a large text file. I'm not as sure about the last part though. As always when time is a concern, I would try both of these options and then measure the performance improvement.
If for some reason you really can't just have one large text file / .npy file, you could also probably get a speedup by using, e.g., multiprocessing to have multiple workers reading in the files at the same time. Then you can just concatenate the matrices together at the end.
Not your primary question but since it seems to be an issue - you can rewrite the text files to not have those extra newlines, but I don't think np.loadtxt can ignore them. If you're open to using pandas, though, pandas.read_csv with skip_blank_lines=True should handle that for you. To get a numpy.ndarray from a pandas.DataFrame, just do dataframe.values.
Let use pandas.read_csv (with C speed) instead of numpy.loadtxt. This is a very helpful post:
http://akuederle.com/stop-using-numpy-loadtxt
I need to use some matrices in Python programs, like
Q = np.matrix([[1,0,1,1,0],
[0,2,0,1,1],
[1,0,2,0,1],
[1,1,0,1,0],
[0,1,1,0,1]])
and I want to import the matrix (use numpy) from a file, so what should I do to realize it? what code should I write and what file should I use (.txt?). I am quite new to python, anyone can help me? Thank you in advance.
I'm assuming that you're not only importing the matrices, but also exporting them to files in the first place.
If that's true, there are multiple easy options, with different tradeoffs.
np.save saves the array in a binary format that's only usable by NumPy. But it's very fast, and generates reasonably small files.
np.save('matrix.npy', Q)
Q = np.load('matrix.npy')
np.savetxt saves the array in a text file, using a dialect of CSV (with whitespace separators, by default). It's slower, and generates bigger files, but if you want to be able to read or edit the files (or send them through an ASCII-only channel, like email without attachments), it's the best option.
np.savetxt('matrix.txt', Q)
Q = np.loadtxt('matrix.txt')
np.savetxt can also save the array in a compressed text file. This gives you small files, but they're slower to save and load. They're not directly human-readable, but it's very easy to un-gzip a file, and then you've got a text file you can read and edit. So, sometimes this is worth doing.
np.savetxt('matrix.txt.gz', Q)
Q = np.loadtxt('matrix.txt.gz')
Finally, you can just use standard Python saving and loading mechanisms, like pickle:
with open('matrix.pickle', 'wb') as f:
pickle.dump(Q, f)
with open('matrix.pickle', 'rb') as f:
Q = pickle.load(f)
This is really only useful if you need to store NumPy arrays together with non-NumPy objects.
If you have to save multiple matrices, instead of saving one per file, you might want to look at savez and savez_compressed. Or, if you need multiple objects, only some of which are NumPy, pickle may be the best option.
I have some json files with 500MB.
If I use the "trivial" json.load() to load its content all at once, it will consume a lot of memory.
Is there a way to read partially the file? If it was a text, line delimited file, I would be able to iterate over the lines. I am looking for analogy to it.
There was a duplicate to this question that had a better answer. See https://stackoverflow.com/a/10382359/1623645, which suggests ijson.
Update:
I tried it out, and ijson is to JSON what SAX is to XML. For instance, you can do this:
import ijson
for prefix, the_type, value in ijson.parse(open(json_file_name)):
print prefix, the_type, value
where prefix is a dot-separated index in the JSON tree (what happens if your key names have dots in them? I guess that would be bad for Javascript, too...), theType describes a SAX-like event, one of 'null', 'boolean', 'number', 'string', 'map_key', 'start_map', 'end_map', 'start_array', 'end_array', and value is the value of the object or None if the_type is an event like starting/ending a map/array.
The project has some docstrings, but not enough global documentation. I had to dig into ijson/common.py to find what I was looking for.
So the problem is not that each file is too big, but that there are too many of them, and they seem to be adding up in memory. Python's garbage collector should be fine, unless you are keeping around references you don't need. It's hard to tell exactly what's happening without any further information, but some things you can try:
Modularize your code. Do something like:
for json_file in list_of_files:
process_file(json_file)
If you write process_file() in such a way that it doesn't rely on any global state, and doesn't
change any global state, the garbage collector should be able to do its job.
Deal with each file in a separate process. Instead of parsing all the JSON files at once, write a
program that parses just one, and pass each one in from a shell script, or from another python
process that calls your script via subprocess.Popen. This is a little less elegant, but if
nothing else works, it will ensure that you're not holding on to stale data from one file to the
next.
Hope this helps.
Yes.
You can use jsonstreamer SAX-like push parser that I have written which will allow you to parse arbitrary sized chunks, you can get it here and checkout the README for examples. Its fast because it uses the 'C' yajl library.
It can be done by using ijson. The working of ijson has been very well explained by Jim Pivarski in the answer above. The code below will read a file and print each json from the list. For example, file content is as below
[{"name": "rantidine", "drug": {"type": "tablet", "content_type": "solid"}},
{"name": "nicip", "drug": {"type": "capsule", "content_type": "solid"}}]
You can print every element of the array using the below method
def extract_json(filename):
with open(filename, 'rb') as input_file:
jsonobj = ijson.items(input_file, 'item')
jsons = (o for o in jsonobj)
for j in jsons:
print(j)
Note: 'item' is the default prefix given by ijson.
if you want to access only specific json's based on a condition you can do it in following way.
def extract_tabtype(filename):
with open(filename, 'rb') as input_file:
objects = ijson.items(input_file, 'item.drugs')
tabtype = (o for o in objects if o['type'] == 'tablet')
for prop in tabtype:
print(prop)
This will print only those json whose type is tablet.
On your mention of running out of memory I must question if you're actually managing memory. Are you using the "del" keyword to remove your old object before trying to read a new one? Python should never silently retain something in memory if you remove it.
Update
See the other answers for advice.
Original answer from 2010, now outdated
Short answer: no.
Properly dividing a json file would take intimate knowledge of the json object graph to get right.
However, if you have this knowledge, then you could implement a file-like object that wraps the json file and spits out proper chunks.
For instance, if you know that your json file is a single array of objects, you could create a generator that wraps the json file and returns chunks of the array.
You would have to do some string content parsing to get the chunking of the json file right.
I don't know what generates your json content. If possible, I would consider generating a number of managable files, instead of one huge file.
Another idea is to try load it into a document-store database like MongoDB.
It deals with large blobs of JSON well. Although you might run into the same problem loading the JSON - avoid the problem by loading the files one at a time.
If path works for you, then you can interact with the JSON data via their client and potentially not have to hold the entire blob in memory
http://www.mongodb.org/
"the garbage collector should free the memory"
Correct.
Since it doesn't, something else is wrong. Generally, the problem with infinite memory growth is global variables.
Remove all global variables.
Make all module-level code into smaller functions.
in addition to #codeape
I would try writing a custom json parser to help you figure out the structure of the JSON blob you are dealing with. Print out the key names only, etc. Make a hierarchical tree and decide (yourself) how you can chunk it. This way you can do what #codeape suggests - break the file up into smaller chunks, etc
You can parse the JSON file to CSV file and you can parse it line by line:
import ijson
import csv
def convert_json(self, file_path):
did_write_headers = False
headers = []
row = []
iterable_json = ijson.parse(open(file_path, 'r'))
with open(file_path + '.csv', 'w') as csv_file:
csv_writer = csv.writer(csv_file, ',', '"', csv.QUOTE_MINIMAL)
for prefix, event, value in iterable_json:
if event == 'end_map':
if not did_write_headers:
csv_writer.writerow(headers)
did_write_headers = True
csv_writer.writerow(row)
row = []
if event == 'map_key' and not did_write_headers:
headers.append(value)
if event == 'string':
row.append(value)
So simply using json.load() will take a lot of time. Instead, you can load the json data line by line using key and value pair into a dictionary and append that dictionary to the final dictionary and convert it to pandas DataFrame which will help you in further analysis
def get_data():
with open('Your_json_file_name', 'r') as f:
for line in f:
yield line
data = get_data()
data_dict = {}
each = {}
for line in data:
each = {}
# k and v are the key and value pair
for k, v in json.loads(line).items():
#print(f'{k}: {v}')
each[f'{k}'] = f'{v}'
data_dict[i] = each
Data = pd.DataFrame(data_dict)
#Data will give you the dictionary data in dataFrame (table format) but it will
#be in transposed form , so will then finally transpose the dataframe as ->
Data_1 = Data.T