vaex - create a dataframe from a list of lists - python

In Vaex's docs, I cannot find a way to create a dataframe from a list of lists.
In pandas I would simply do pd.DataFrame([['A',1,3], ['B',2,4]]).
How can this be done in Vaex?

There is no method to read list of lists in vaex, however, there is vaex.from_arrays() and it works like this:
vaex.from_arrays(column_name_1=list_of_values_1, column_name_2=list_of_values_2)
If you consider different python data structures, you may want to look at vaex.from_dict() or vaex.from_items()
Since you already have the data in memory, you can use pd.DataFrame(list_of_lists) and then load it into vaex:
vaex.from_pandas(pd.DataFrame(list_of_lists))
You may want to del the list of lists data afterwards, to free up memory.

Related

How to view contents of a RDD after using map or split (pyspark)?

I am trying to view the contents of the return value of a
labID = fields.map(lambda x: x.split('ID')(1).trim())
When I do:
print(labID)
What I see is this:
PythonRDD[97] at RDD at PythonRDD.scala:48
I see examples of this in Scala, but I can't find examples with pyspark. Scala examples:
myRDD.collect().foreach(println) or myRDD.take(n).foreach(println)
How would I do this with pyspark?
When you call collect() or take(), you get back a list of elements in the rdd. You can then print these values as you would any normal python list.
Since collect() is expensive and slow, I recommend you first try taking a sample of your data to make sure it looks correct:
labID = fields.map(lambda x: x.split('ID')[1].trim())
labID_sample = labID.take(5) # returns the first 5 elements in the rdd as a list
print(labID_sample) # print the list
If you're satisfied that your results look correct, you can grab the whole thing:
labID_all = labID.collect() # returns all elements in the rdd as a list
print(labID_all)
Be aware that these operations bring the data back into local memory, going through your master node. If you have a lot of data this could be very slow or it may fail. In that case, you should consider writing the rdd to disk using saveAsTextFile().

How do I lazily concatenate "numpy ndarray"-like objects for sequential reading?

I have a list of several large hdf5 files, each with a 4D dataset. I would like to obtain a concatenation of them on the first axis, as in, an array-like object that would be used as if all datasets were concatenated. My final intent is to sequentially read chunks of the data along the same axis (e.g. [0:100,:,:,:], [100:200,:,:,:], ...), multiple times.
Datasets in h5py share a significant part of the numpy array API, which allows me to call numpy.concatenate to get the job done:
files = [h5.File(name, 'r') for name in filenames]
X = np.concatenate([f['data'] for f in files], axis=0)
On the other hand, the memory layout is not the same, and memory cannot be shared among them (related question). Alas, concatenate will eagerly copy the entire content of each array-like object into a new array, which I cannot accept in my use case. The source code of the array concatenation function confirms this.
How can I obtain a concatenated view over multiple array-like objects, without eagerly reading them to memory? As far as this view is concerned, slicing and indexing over this view would behave just as if I had a concatenated array.
I can imagine that writing a custom wrapper would work, but I would like to know whether such an implementation already exists as a library, or whether another solution to the problem is available and just as feasible. My searches so far have yielded nothing of this sort. I am also willing to accept solutions specific to h5py.
flist = [f['data'] for f in files] is a list of dataset objects. The actual data is on the h5 files, is accessible as long as those files remain open.
When you do
arr = np.concatenate(flist, axis=0)
I imagine concatenate first does
tmep = [np.asarray(a) for a in flist]
that is, construct a list of numpy arrays. I assume np.asarray(f['data']) is the same as f['data'].value or f['data'][:] (as I discussed 2 yrs ago in the linked SO question). I should do some time tests comparing that with
arr = np.concatenate([a.value for a in flist], axis=0)
flist is a kind of lazy compilation of these data sets, in that the data still resides on the file, and is accessed only when you do something more.
[a.value[:,:,:10] for a in flist]
would load a portion of each of those data sets into memory; I expect that a concatenate on that list would be the equivalent of arr[:,:,:10].
Generators or generator comprehensions are a form of lazy evaluation, but I think they have to be turned into lists before use in concatenate. In any case, the result of concatenate is always an array with all the data in a contiguous block of memory. It is never blocks of data residing in files.
You need to tell us more about what intend to do with this large concatenated array of data sets. As outline I think you can construct arrays that contain slices of all the data sets. You could also perform other actions as I demonstrate in the previous answer - but with an access time cost.

Efficient insertion of row into sorted DataFrame

My problem requires the incremental addition of rows into a sorted DataFrame (with a DateTimeIndex), but I'm currently unable to find an efficient way to do this. There doesn't seem to be any concept of an "insort".
I've tried appending the row and resorting in place, and I've also tried getting the insertion point with searchsorted and slicing and concatenating to create a new DataFrame. Both being "too slow".
Is Pandas just not suited to jobs where you don't have all the data at once and instead get your data incrementally?
Solutions I've tried:
Concatenation
def insert_data(df, data, index):
insertion_index = df.index.searchsorted(index)
new_df = pandas.concat([df[:insertion_index], pandas.DataFrame(data, index=[index]), df[insertion_index:]])
return new_df, insertion_index
Resorting
def insert_data(df, data, index):
new_df = df.append(pandas.DataFrame(data, index=[index]))
new_df.sort_index(inplace=True)
return new_df
pandas is built on numpy. numpy arrays are fixed sized objects. While there are numpy append and insert functions, in practice they construct new arrays from the old and new data.
There are 2 practical approaches to incrementally defining these arrays:
initialize a large empty array, and fill in values incrementally
incrementally create a Python list (or dictionary), and create the array from the completed list.
Appending to a Python list is a common and fast task. There is also a list insert, but it is slower. For sorted inserts there are specialized Python structures (e.g. bisect).
Pandas may have added functions to deal with common creation scenarios. But unless it has coded something special in C it is unlikely to be faster than a more basic Python structure.
Even if you have to use Pandas features at various points along the incremental build, it might best to create a new DataFrame on the fly from the underlying Python structure.

I have single-element arrays. How do I change them into the elements themselves?

Importing a JSON document into a pandas dataframe using records = pandas.read_json(path), where path was a pre-defined path to the JSON document, I discovered that the content of certain columns of the resulting dataframe "records" are not simply strings as expected. Instead, each "cell" in such a column is an array, containing one single element -- the string of interest. This makes selecting columns using boolean indexing difficult. For example, records[records['category']=='Python Books'] in Ipython outputs an empty dataframe; had the "cells" contained strings instead of arrays of strings, the output would have been nonempty, containing rows that correspond to python books.
I could modify the JSON document, so that "records" reads the strings in properly. But is there a way to modify "records" directly, to somehow strip the single-element arrays into the elements themselves?
Update: After clarification, I believe this might accomplish what you want while limiting it to a single iteration over the data:
nested_column_1 = records["column_name_1"]
nested_column_2 = records["column_name_2"]
clean_column_1 = []
clean_column_2 = []
for i in range(0, len(records.index):
clean_column_1.append(nested_column_1[i][0])
clean_column_2.append(nested_column_2[i][0])
Then you convert the clean_column lists to Series like you mentioned in your comment. Obviously, you make as many nested_column and clean_column lists as you need, and update them all in the loop.
You could generalize this pretty easily by keeping a record of "problem" columns and using that to create a data structure to manage the nested/clean lists, rather than declaring them explicitly as I did in my example. But I thought this might illustrate the approach more clearly.
Obviously, this assumes that all columns have the same number of elements, which maybe isn't a a valid assertion in your case.
Original Answer:
Sorry if I'm oversimplifying or misunderstanding the problem, but could you just do something like this?
simplified_list = [element[0] for element in my_array_of_arrays]
Or if you don't need the whole thing at once, just a generator instead:
simplifying_generator = (element[0] for element in my_array_of_arrays)

Python: large number of dict like objects memory use

I am using csv.DictReader to read some large files into memory to then do some analysis, so all objects from multiple CSV files need to be kept in memory. I need to read them as Dictionary to make analysis easier, and because the CSV files may be altered by adding new columns.
Yes SQL can be used, but I'd rather avoid it if it's not needed.
I'm wondering if there is a better and easier way of doing this. My concern is that I will have many dictionary objects with same keys and waste memory? The use of __slots__ was an option, but I will only know the attributes of an object after reading the CSV.
[Edit:] Due to being on legacy system and "restrictions", use of third party libraries is not possible.
If you are on Python 2.6 or later, collections.namedtuple is what you are asking for.
See http://docs.python.org/library/collections.html#collections.namedtuple
(there is even an example of using it with csv).
EDIT: It requires the field names to be valid as Python identifiers, so perhaps it is not suitable in your case.
Have you considered using pandas.
It is works very good for tables. Relevant for you are the read_csv function and the dataframe type.
This is how you would use it:
>>> import pandas
>>> table = pandas.read_csv('a.csv')
>>> table
a b c
0 1 2 a
1 2 4 b
2 5 6 word
>>> table.a
0 1
1 2
2 5
Name: a
Use python shelve. It is a dictionary like object but can be dumped on disk when required and loaded back very easily.
If all the data in one column are the same type, you can use NumPy. NumPy's loadtxt and genfromtxt function can be used to read csv file. And because it returns an array, the memory usage is smaller then dict.
Possibilities:
(1) Benchmark the csv.DictReader approach and see if it causes a problem. Note that the dicts contain POINTERS to the keys and values; the actual key strings are not copied into each dict.
(2) For each file, use csv.Reader, after the first row, build a class dynamically, instantiate it once per remaining row. Perhaps this is what you had in mind.
(3) Have one fixed class, instantiated once per file, which gives you a list of tuples for the actual data, a tuple that maps column indices to column names, and a dict that maps column names to column indices. Tuples occupy less memory than lists because there is no extra append-space allocated. You can then get and set your data via (row_index, column_index) and (row_index, column_name).
In any case, to get better advice, how about some simple facts and stats: What version of Python? How many files? rows per file? columns per file? total unique keys/column names?

Categories

Resources