Python: large number of dict like objects memory use - python

I am using csv.DictReader to read some large files into memory to then do some analysis, so all objects from multiple CSV files need to be kept in memory. I need to read them as Dictionary to make analysis easier, and because the CSV files may be altered by adding new columns.
Yes SQL can be used, but I'd rather avoid it if it's not needed.
I'm wondering if there is a better and easier way of doing this. My concern is that I will have many dictionary objects with same keys and waste memory? The use of __slots__ was an option, but I will only know the attributes of an object after reading the CSV.
[Edit:] Due to being on legacy system and "restrictions", use of third party libraries is not possible.

If you are on Python 2.6 or later, collections.namedtuple is what you are asking for.
See http://docs.python.org/library/collections.html#collections.namedtuple
(there is even an example of using it with csv).
EDIT: It requires the field names to be valid as Python identifiers, so perhaps it is not suitable in your case.

Have you considered using pandas.
It is works very good for tables. Relevant for you are the read_csv function and the dataframe type.
This is how you would use it:
>>> import pandas
>>> table = pandas.read_csv('a.csv')
>>> table
a b c
0 1 2 a
1 2 4 b
2 5 6 word
>>> table.a
0 1
1 2
2 5
Name: a

Use python shelve. It is a dictionary like object but can be dumped on disk when required and loaded back very easily.

If all the data in one column are the same type, you can use NumPy. NumPy's loadtxt and genfromtxt function can be used to read csv file. And because it returns an array, the memory usage is smaller then dict.

Possibilities:
(1) Benchmark the csv.DictReader approach and see if it causes a problem. Note that the dicts contain POINTERS to the keys and values; the actual key strings are not copied into each dict.
(2) For each file, use csv.Reader, after the first row, build a class dynamically, instantiate it once per remaining row. Perhaps this is what you had in mind.
(3) Have one fixed class, instantiated once per file, which gives you a list of tuples for the actual data, a tuple that maps column indices to column names, and a dict that maps column names to column indices. Tuples occupy less memory than lists because there is no extra append-space allocated. You can then get and set your data via (row_index, column_index) and (row_index, column_name).
In any case, to get better advice, how about some simple facts and stats: What version of Python? How many files? rows per file? columns per file? total unique keys/column names?

Related

vaex - create a dataframe from a list of lists

In Vaex's docs, I cannot find a way to create a dataframe from a list of lists.
In pandas I would simply do pd.DataFrame([['A',1,3], ['B',2,4]]).
How can this be done in Vaex?
There is no method to read list of lists in vaex, however, there is vaex.from_arrays() and it works like this:
vaex.from_arrays(column_name_1=list_of_values_1, column_name_2=list_of_values_2)
If you consider different python data structures, you may want to look at vaex.from_dict() or vaex.from_items()
Since you already have the data in memory, you can use pd.DataFrame(list_of_lists) and then load it into vaex:
vaex.from_pandas(pd.DataFrame(list_of_lists))
You may want to del the list of lists data afterwards, to free up memory.

Stepwise creation of a YAML file

I am facing the following problem: I create a big data set (several 10GB) of python objects. I want to create an output file in YAML format containing an entry for each object that contains information about the object saved as a nested dictionary. However, I never hold all data in memory at the same time.
The output data should be stored in a dictionary mapping an object name to the saved values. A simple version would look like this:
object_1:
value_1: 42
value_2: 23
object_2:
value_1: 17
value_2: 13
[...]
object_a_lot:
value_1: 47
value_2: 11
To keep a low memory footprint, I would like to write the entry for each object and immediately delete it after writing. My current approach is as follows:
from yaml import dump
[...] # initialize huge_object_list. Here it is still small
with open("output.yaml", "w") as yaml_file:
for my_object in huge_object_list:
my_object.compute() # this blows up the size of the object
# create a single entry for the top level dict
object_entry = dump(
{my_object.name: my_object.get_yaml_data()},
default_flow_style=False,
)
yaml_file.write(object_entry)
my_object.delete_big_stuff() # delete the memory consuming stuff in the object, keep other information which is needed later
Basically I am writing several dictionaries, but each only has one key and since the object names are unique this does not blow up. This works, but feels like a bit of a hack and I would like to ask if someone knows of a way to do this better/ proper.
Is there a way to write a big dictionary to a YAML file, one entry at a time?
If you want to write out a YAML file in stages, you can do it the way you describe.
If your keys are not guaranteed to be unique, then I would recommend using a sequence (i.e. list a the top-level (even with one item), instead of a mapping.
This doesn't solve the problem of re-reading the file as PyYAML will try to read the file as a whole and that is not going load quickly, and keep in mind that the memory overhead of PyYAML will require for loading a file can easily be over 100x (a hundred times) the file size. My ruamel.yaml is wrt to memory somewhat better but still requires several tens of times the file size in memory.
You can of course cut up a file based on "leading" spaces, a new key (or dash for an item in case you use sequences) can be easily found in a different way. You can also look at storing each key-value pair in its own document within one file, that vastly reduces the overhead during loading if you combine the key-value pairs of the single documents yourself.
In similar situations I stored individual YAML "objects" in different files, using the filenames as keys to the "object" values. This requires some efficient filesystem (e.g. tail packing) and depends on what is available based on the OS your system is based on.

Big data File: Read and Create structured file

I have a 20+GB dataset that is structured as follows:
1 3
1 2
2 3
1 4
2 1
3 4
4 2
(Note: the repetition is intentional and there is no inherent order in either column.)
I want to construct a file in the following format:
1: 2, 3, 4
2: 3, 1
3: 4
4: 2
Here is my problem; I have tried writing scripts in both Python and C++ to load in the file, create long strings, and write to a file line-by-line. It seems, however, that neither language is capable of handling the task at hand. Does anyone have any suggestions as to how to tackle this problem? Specifically, is there a particular method/program that is optimal for this? Any help or guided directions would be greatly appreciated.
You can try this using Hadoop. You can run a stand-alone Map Reduce program. The mapper will output the first column as key and the second column as value. All the outputs with same key will go to one reducer. So you have a key and a list of values with that key. You can run through the values list and output the (key, valueString) which is the final output you desire. You can start this with a simple hadoop tutorial and do mapper and reducer as I suggested. However, I've not tried to scale a 20GB data on a stand-alone hadoop system. You may try. Hope this helps.
Have you tried using a std::vector of std::vector?
The outer vector represents each row. Each slot in the outer vector is a vector containing all the possible values for each row. This assumes that the row # can be used as an index into the vector.
Otherwise, you can try std::map<unsigned int, std::vector<unsigned int> >, where the key is the row number and the vector contains all values for the row.
A std::list of would work also.
Does your program run out of memory?
Edit 1: Handling large data files
You can handle your issue by treating it like a merge sort.
Open a file for each row number.
Append the 2nd column values to the file.
After all data is read, close all files.
Open each file and read the values and print them out, comma separated.
Open output file for each key.
While iterating over lines of source file append values into output files.
Join output files.
An interesting thought found also on Stack Overflow
If you want to persist a large dictionary, you are basically looking at a database.
As recommended there, use Python's sqlite3 module to write to a table where the primary key is auto incremented, with a field called "key" (or "left") and a field called "value" (or "right").
Then SELECT out of the table which was the MIN(key) and MAX(key), and with that information you can SELECT all rows that have the same "key" (or "left") value, in sorted order, and print those informations to an outfile (if the database is not a good output for you).
I have written this approach in the assumption you call this problem "big data" because the number of keys do not fit well into memory (otherwise, a simple Python dictionary would be enough). However, IMHO this question is not correctly tagged as "big data": in order to require distributed computations on Hadoop or similar, your input data should be much more than what you can hold in a single hard drive, or your computations should be much more costly than a simple hash table lookup and insertion.

I have single-element arrays. How do I change them into the elements themselves?

Importing a JSON document into a pandas dataframe using records = pandas.read_json(path), where path was a pre-defined path to the JSON document, I discovered that the content of certain columns of the resulting dataframe "records" are not simply strings as expected. Instead, each "cell" in such a column is an array, containing one single element -- the string of interest. This makes selecting columns using boolean indexing difficult. For example, records[records['category']=='Python Books'] in Ipython outputs an empty dataframe; had the "cells" contained strings instead of arrays of strings, the output would have been nonempty, containing rows that correspond to python books.
I could modify the JSON document, so that "records" reads the strings in properly. But is there a way to modify "records" directly, to somehow strip the single-element arrays into the elements themselves?
Update: After clarification, I believe this might accomplish what you want while limiting it to a single iteration over the data:
nested_column_1 = records["column_name_1"]
nested_column_2 = records["column_name_2"]
clean_column_1 = []
clean_column_2 = []
for i in range(0, len(records.index):
clean_column_1.append(nested_column_1[i][0])
clean_column_2.append(nested_column_2[i][0])
Then you convert the clean_column lists to Series like you mentioned in your comment. Obviously, you make as many nested_column and clean_column lists as you need, and update them all in the loop.
You could generalize this pretty easily by keeping a record of "problem" columns and using that to create a data structure to manage the nested/clean lists, rather than declaring them explicitly as I did in my example. But I thought this might illustrate the approach more clearly.
Obviously, this assumes that all columns have the same number of elements, which maybe isn't a a valid assertion in your case.
Original Answer:
Sorry if I'm oversimplifying or misunderstanding the problem, but could you just do something like this?
simplified_list = [element[0] for element in my_array_of_arrays]
Or if you don't need the whole thing at once, just a generator instead:
simplifying_generator = (element[0] for element in my_array_of_arrays)

How to create a memoryview for a non-contiguous memory location?

I have a fragmented structure in memory and I'd like to access it as a contiguous-looking memoryview. Is there an easy way to do this or should I implement my own solution?
For example, consider a file format that consists of records. Each record has a fixed length header, that specifies the length of the content of the record. A higher level logical structure may spread over several records. It would make implementing the higher level structure easier if it could see it's own fragmented memory location as a simple contiguous array of bytes.
Update:
It seems that python supports this 'segmented' buffer type internally, at least based on this part of the documentation. But this is only the C API.
Update2:
As far as I see, the referenced C API - called old-style buffers - does what I need, but it's now deprecated and unavailable in newer version of Python (3.X). The new buffer protocol - specified in PEP 3118 - offers a new way to represent buffers. This API is more usable in most of the use cases (among them, use cases where the represented buffer is not contiguous in memory), but does not support this specific one, where a one dimensional array may be laid out completely freely (multiple differently sized chunks) in memory.
First - I am assuming you are just trying to do this in pure python rather than in a c extension. So I am assuming you have loaded in the different records you are interested in into a set of python objects and your problem is that you want to see the higher level structure that is spread across these objects with bits here and there throughout the objects.
So can you not simply load each of the records into a byte arrays type? You can then use python slicing of arrays to create a new array that has just the data for the high level structure you are interested in. You will then have a single byte array with just the data you are interested in and can print it out or manipulate it in any way that you want to.
So something like:
a = bytearray(b"Hello World") # put your records into byte arrays like this
b = bytearray(b"Stack Overflow")
complexStructure = bytearray(a[0:6]+b[0:]) # Slice and join arrays to form
# new array with just data from your
# high level entity
print complexStructure
Of course you will still ned to know where within the records your high level structure is to slice the arrays correctly but you would need to know this anyway.
EDIT:
Note taking a slice of a list does not copy the data in the list it just creates a new set of references to the data so:
>>> a = [1,2,3]
>>> b = a[1:3]
>>> id(a[1])
140268972083088
>>> id(b[0])
140268972083088
However changes to the list b will not change a as b is a new list. To have the changes automatically change in the original list you would need to make a more complicated object that contained the lists to the original records and hid them in such a way as to be able to decide which list and which element of a list to change or view when a user look to modify/view the complex structure. So something like:
class ComplexStructure():
def add_records(self,record):
self.listofrecords.append(record)
def get_value(self,position):
listnum,posinlist = ... # formula to figure out which list and where in
# list element of complex structure is
return self.listofrecords[listnum][record]
def set_value(self,position,value):
listnum,posinlist = ... # formula to figure out which list and where in
# list element of complex structure is
self.listofrecords[listnum][record] = value
Granted this is not the simple way of doing things you were hoping for but it should do what you need.

Categories

Resources