High-dimensional data structure in Python - python

What is best way to store and analyze high-dimensional date in python? I like Pandas DataFrame and Panel where I can easily manipulate the axis. Now I have a hyper-cube (dim >=4) of data. I have been thinking of stuffs like dict of Panels, tuple as panel entries. I wonder if there is a high-dim panel thing in Python.
update 20/05/16:
Thanks very much for all the answers. I have tried MultiIndex and xArray, however I am not able to comment on any of them. In my problem I will try to use ndarray instead as I found the label is not essential and I can save it separately.
update 16/09/16:
I came up to use MultiIndex in the end. The ways to manipulate it are pretty tricky at first, but I kind of get used to it now.

MultiIndex is most useful for higher dimensional data as explained in the docs and this SO answer because it allows you to work with any number of dimension in a DataFrame environment.
In addition to the Panel, there is also Panel4D - currently in experimental stage. Given the advantages of MultiIndex I wouldn't recommend using either this or the three dimensional version. I don't think these data structures have gained much traction in comparison, and will indeed be phased out.

If you need labelled arrays and pandas-like smart indexing, you can use xarray package which is essentially an n-dimensional extension of pandas Panel (panels are being deprecated in pandas in future in favour of xarray).
Otherwise, it may sometimes be reasonable to use plain numpy arrays which can be of any dimensionality; you can also have arbitrarily nested numpy record arrays of any dimension.

I recommend continuing to use DataFrame but utilize the MultiIndex feature. DataFrame is better supported and you preserve all of your dimensionality with the MultiIndex.
Example
df = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'], index=['A', 'B'])
df3 = pd.concat([df for _ in [0, 1]], keys=['one', 'two'])
df4 = pd.concat([df3 for _ in [0, 1]], axis=1, keys=['One', 'Two'])
print df4
Looks like:
One Two
a b a b
one A 1 2 1 2
B 3 4 3 4
two A 1 2 1 2
B 3 4 3 4
This is a hyper-cube of data. And you'll be much better served with support and questions and less bugs and many other benefits.

Related

Can I use pd.concat to add new columns equal to other columns in DataFrame?

I am new to Python and am converting SQL to Python and want to learn the most efficient way to process a large dataset (rows > 1 million and columns > 100). I need to create multiple new columns based on other columns in the DataFrame. I have recently learned how to use pd.concat for new boolean columns, but I also have some non-boolean columns that rely on the values of other columns.
In SQL I would use a single case statement (case when age > 1000 then sample_id else 0 end as custom1, etc...). In Python I can achieve the same result in 2 steps (pd.concat + loc find & replace) as shown below. I have seen references in other posts to using the apply method but have also read in other posts that the apply method can be inefficient.
My question is then, for the code shown below, is there a more efficient way to do this? Can I do it all in one step within the pd.concat (so far I haven't been able to get that to work)? I am okay doing it in 2 steps if necessary. I need to be able to handle large integers (100 billion) in my custom1 element and have decimals in my custom2 element.
And finally, I tried using multiple separate np.where statements but received a warning that my DataFrame was fragmented and that I should try to use concat. So I am not sure which approach overall is most efficient or recommended.
Update - after receiving a comment and an answer pointing me towards use of np.where, I decided to test the approaches. Using a data set with 2.7 million rows and 80 columns, I added 25 new columns. First approach was to use the concat + df.loc replace as shown in this post. Second approach was to use np.where. I ran the test 10 times and np.where was faster in all 10 trials. As noted above, I think repeated use of np.where in this way can cause fragmentation, so I suppose now my decision comes down to faster np.where with potential fragmentation vs. slower use of concat without risk of fragmentation. Any further insight on this final update is appreciated.
df = pd.DataFrame({'age': [120, 4000],
'weight': [505.31, 29.01],
'sample_id': [999999999999, 555555555555]},
index=['rock1', 'rock2'])
#step 1: efficiently create starting custom columns using concat
df = pd.concat(
[
df,
(df["age"] > 1000).rename("custom1").astype(int),
(df["weight"] < 100).rename("custom2").astype(float),
],
axis=1,
)
#step2: assign final values to custom columns based on other column values
df.loc[df.custom1 == 1, 'custom1'] = (df['sample_id'])
df.loc[df.custom2 == 1, 'custom2'] = (df['weight'] / 2)
Thanks for any feedback you can provide...I appreciate your time helping me.
The standard way to do this is using numpy where:
import numpy as np
df['custom1'] = np.where(df.age.gt(1000), df.sample_id, 0)
df['custom2'] = np.where(df.weight.lt(100), df.weight / 2, 0)

How to transpose a PyArrow.Table object in Python using PyArrow structures only (preferably maintaining contiguous memory ordering)?

I wonder if there's a way to transpose PyArrow tables without e.g. converting them to pandas dataframes or python objects in between.
Right now I'm using something similar to the following example, which I don't think is very efficient (I left out the schema for conciseness):
import numpy as np
import pyarrow as pa
np.random.seed(1234) # For reproducibility
N, M = 3, 4
arrays = [pa.array(np.random.randint(0, 4, N)) for _ in range(M)]
names = [str(x) for x in range(M)]
table = pa.Table.from_arrays(arrays, names)
print("Original:\n", table.to_pandas().values)
transposed = table.from_pandas(table.to_pandas().T)
print("\nTransposed:\n", transposed.to_pandas().values)
Resulting nicely in:
Original:
[[3 1 0 1]
[3 0 1 3]
[2 0 3 1]]
Transposed:
[[3 3 2]
[1 0 0]
[0 1 3]
[1 3 1]]
In the program I'm working on currently, I'm using PyArrow to prevent what seems to be a memory leaking issue I encountered using pandas dataframes, of which I couldn't pin down the exact source/cause beyond the use of dataframes being the origin.
Hence, besides efficiency reasons, not wanting to use pandas objects here was the reason to use PyArrow data structures in the first place.
Is there a more direct way to do this?
If so, would the transposed result have contiguous memory blocks if the original table is also contiguous?
Also, would calling transposed.combine_chuncks() reorder memory for this table to be contiguous along the columnar axis?
Is there a more direct way to do this?
No. It's not possible today. You're welcome to file a JIRA ticket. I couldn't find one.
The C++ API has array builders which would make this pretty straightforward but there is no python support for these at the moment (there is a JIRA for that https://issues.apache.org/jira/browse/ARROW-3917 but the marshaling overhead would probably become a bottleneck even if that was available).
If so, would the transposed result have contiguous memory blocks if the original table is also contiguous?
Also, would calling transposed.combine_chuncks() reorder memory for this table to be contiguous along the columnar axis?
Arrow arrays are always contiguous along the columnar axis. Are you asking if the entire table would be represented as one contiguous memory region? In that case the answer is no. Arrow does not try and represent entire tables as a single contiguous range.
Arrow being a columnar format, it doesn't lend itself well to this king of workload (tables with uniform types that are more like tensor/matrices).
The same could be said to pandas (to a lesser extent), and numpy is better suited for this type of payload. So instead of converting to pandas and transposing, you could convert to numpy and transpose.
It requires a bit more code, because the conversion from arrow to numpy only works at array/column level (not at table level). See the doc
transposed_matrix = np.array([col.to_numpy() for col in table]).T
transposed_arrays = [pa.array(col) for col in transposed_matrix]
transposed_names = [str(x) for x in range(len(transposed_arrays))]
transposed_table = table.from_arrays(transposed_arrays, names=transposed_names)

Efficient conditional validation of large dataset in python

I have a simple/flat dataset that looks like...
columnA columnB columnC
value1a value1b value1c
value2a value2b value2c
...
valueNa valueNb valueNc
Although the structure is simple, it's tens of millions of rows deep and I have 50+ columns.
I need to validate that each value in a row conforms with certain format requirements. Some checks are simple (e.g. isDecimal, isEmpty, isAllowedValue etc) but some involve references to other columns (e.g. does columnC = columnA / columnB) and some involve conditional validations (e.g. if columnC = x, does columnB contain y).
I started off thinking that the most efficient way to validate this data was by applying lambda functions to my dataframe...
df.apply(lambda x: validateCol(x), axis=1)
But it seems like this can't support the full range of conditional validations I need to perform (where specific cell validations need to refer to other cells in other columns).
Is the most efficient way to do this to simply loop through all rows one-by-one and check each cell one-by-one? At the moment, I'm resorting to this but it's taking several minutes to get through the list...
df.columns = ['columnA','columnB','columnC']
myList = df.T.to_dict().values() #much faster to iterate over list
for row in myList:
#validate(row['columnA'], row['columnB'], row['columnC'])
Thanks for any thoughts on the most efficient way to do this. At the moment, my solution works, but it feels ugly and slow!
Iterating over rows is very inefficient. You should work on columns directly using vectorized functions or Pandas functions. Indeed, Pandas store data column-major in memory. Thus, row by row accesses require data from many different locations to be fetched and aggregated from memory. Vectorization in not possible or hard to achieve efficiently (since most hardware themselves cannot vectorize this). Moreover, Python loops are generally very slow (if you use the mainstream CPython interpreter).
You can work on columns using Pandas or Numpy directly. Note that Numpy operation are much faster if you work on well-defined native types (float, small integers and bounded strings, as opposed to Python objects like unbounded strings and large integers). Note also that Pandas already store data using Numpy internally.
Here is an example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'columnA':[1, 4, 5, 7, 8], 'columnB':[4, 7, 3, 6, 9], 'columnC':[5, 8, 6, 1, 2]})
a = df['columnA'].to_numpy()
b = df['columnB'].to_numpy()
c = df['columnC'].to_numpy()
# You can use np.isclose for a row-by-row result stored into an array
print(np.allclose(a/b, c))
Another approach is to split your dataset into smaller pieces and parallelize computation.
For validations part , type, or schema i suggest to use a library like voluptuous or others, i found a schema very maintainable approach.
Anyway, as told by #jerome vectorial approach can save you a lot of computational time

Python scatter matrices from dataframe with too many columns

I am new to python and data science, and I am currently working on a project that is based on a very large dataframe, with 75 columns. I am doing some data exploration and I would like to check for possible correlations between the columns. For smaller dataframes I know I could use pandas plotting.scatter_matrix() on the dataframe in order to do so. However, in my case this produces a 75x75 matrix -- and I can't even visualize the individual plots.
An alternative would be creating lists of 5 columns and using scatter_matrix multiple times, but this method would produce too many scatter matrices. For instance, with 15 columns this would be:
import pandas as pd
df = pd.read_csv('dataset.csv')
list1 = [df.iloc[:, i] for i in range(5)]
list2 = [df.iloc[:, i+5] for i in range(5)]
list3 = [df.iloc[:, i+10] for i in range(5)]
pd.plotting.scatter_matrix(df_acoes[list1])
pd.plotting.scatter_matrix(df_acoes[list2])
pd.plotting.scatter_matrix(df_acoes[list3])
In order to use this same method with 75 columns, I'd have to go on until list15. This looks very inefficient. I wonder if there would be a better way to explore correlations in my dataset.
The problem here is to a lesser extend the technical part. The production of the plots (in number 5625) will take quite a long time. Additionally, the plots will take a bit of memory.
So I would ask a few questions to get around the problems:
Is it really necessary to have all these scatter plots?
Can I reduce the dimensional in advance?
Why do I have such a high number of dimensions?
If the plots are really useful, You could produce them by your own and stick them together, or wait until the function is ready.

Pandas: Edit part of dataframe, have it affect main dataframe

EDIT: A suggested possible duplicate (this question) is not a duplicate. I'm asking if a slice of a dataframe can be edited and have that slice affect the original dataframe. The "duplicate" Q/A suggested is just looking for an alternate to .loc. The simple answer to my original question appears to be, "no".
Original Question:
This question likely has a duplicate somewhere, but I couldn't find it. Also, I'm guessing what I'm about to ask isn't possible, but worth a shot.
I'm looking to be able to filter or mask a large dataframe, get a smaller dataframe for ease of coding, edit the smaller dataframe, and have it affect the larger dataframe.
So something like this:
df_full = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6]})
df_part = df_full[df_full['a'] == 2]
df_part['b'] = 'Kentucky Fried Chicken'
print df_full
Would result in:
a b
0 1 4
1 2 Kentucky Fried Chicken
2 3 6
I'm well aware of the ability to use the .loc[row_indexer, col_indexer] functionality, but even with a mask variable as the row_indexer, it can be a little unwieldy for more complex purposes.
A little context - I'm loading large database tables into a dataframe and want to make many edits on a small slice of it. So the .loc[] gets tedious. Maybe I could filter out that small slice, edit it, then re-append to the original?
Any thoughts?
Short answer
No. You don't want to play the game where you have to keep checking / guessing whether you are using a copy or a view of a dataframe.
Single update: the right way
.loc accessor is the way to go. There is nothing unwieldy about it, though it takes some getting used to.
However complex your criteria, if it boils down to Boolean arrays, .loc accessor is still often the right choice. You need to show an example where it is genuinely difficult to implement.
df_full = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6]})
df_full.loc[df_full['a'] == 2, 'b'] = 'Kentucky Fried Chicken'
# a b
# 0 1 4
# 1 2 Kentucky Fried Chicken
# 2 3 6
Single update: an alternative way
If you find .loc accessor difficult to implement, one alternative is numpy.where:
df_full['b'] = np.where(df_full['a'] == 2, 'Kentucky Fried Chicken', df_full['b'])
Multiple updates: for many conditions
pandas.cut, numpy.select or numpy.vectorize can be used to good effect to streamline your code. The usefulness of these will depend on the specific logic you are attempting to apply. The below question includes examples for each of these:
Numpy “where” with multiple conditions

Categories

Resources