python pandas multiindex to array of matrices - python

I have a pandas DataFrame - assume, for example, that it contains some rolling covariance estimates:
df = #some data
rolling_covariance = df.rolling(window=100).cov()
If the data has n columns then rolling_covariance will contain m n by n matrices, where m is the number of rows in df.
Is there quick/one-liner to transform rolling_covariance into a numpy array of matrices? For example you can access individual rows in rolling_covariance using iloc, you can also iterate through all of the first level of the multiindex and extract the data that way - but I'd like something fast and simple if available

Related

Efficient way to create np array based on values in data frame

I have a data frame with N rows containing certain information. Depending on the values in the data frame, I want to create a numpy array with the same number of rows but with M columns.
I have a solution where I iterate through the rows of the data frame and apply a function, which outputs me a row for the array with M entries.
However, I am thinking about whether there are smarter, more efficient ways to avoid iterating through the df?
edit://
Apologies, I think the description might not be really good.
So I have a df with N rows. Depending on the values of certain columns, I want to create M binary entries for each row, that I store in a separate np array.
E.g. the function that I defined can look like this:
def func(row):
ret = np.zeros(12)
if row['A'] == 'X':
ret[3] = 1
else:
ret[[3,6,9]]=1
return ret
And currently I am applying this (simplified) function to each row of the df to get a full (N,M) array, which seems to be a bit inefficient.
See Pandas groupby() to group depending on M and than extract.

adding a column of numpy arrays to an existing Pandas DataFrame

I have a Pandas DataFrame to which I would like to add a new column that I will then populate with numpy arrays, such that each row in that column contains a numpy array. I'm using the following approach, and am wondering whether this is the correct approach.
df['embeddings'] = pd.Series(dtype='object')
Then I would iterate over rows and add computed arrays like so (using np.zeros(1024) for illustration only, in reality these are the output of a neural network):
for i in range(df.shape[0]):
df['embeddings'].loc[i] = np.zeros(1024)
I tested whether it helps to pre-allocate the cells like so, but didn't notice a difference in execution time when I then iterate over rows, at least not with a DataFrame that only has 200 rows:
df['embeddings'] = [np.zeros(1024)] * df.shape[0]
As alternative to adding a column to then update the rows in it, one could create the list of numpy arrays first, to then add the list as a new column, but that would require more memory.

subtract m rows of dataframe from other m rows

I have dataframe with n rows. All values in dataframe can be assumed to be integers. I wish to subtract particular m rows from another set of m rows. For eg. i wish to do-
df[i:i+m] - df[j:j+m]
This should return a dataframe.
You can use NumPy representation of your sliced dataframes and feed into pd.DataFrame constructor:
res = pd.DataFrame(df.iloc[i:i+m].values - df.iloc[j:j+m].values,
columns=df.columns)

Dividing a very large dataframe into n random dataframes of size m - Python

I have a dataframe of size N =~ (3Million,79). I need to make 1k dataframes of size 3,000 where each one is a random subset of the dataframe previously described. Furthermore, it is without replacement. That way I get the totality of the data but divided randomly into 1k dataframes.
Once you decide in how many parts n you want to split your dataframe you can just do
import pandas as pd
import numpy as np
dfs = np.array_split(df.sample(frac=1), n)

Slicing Pandas DataFrame with an array of integers specifying location

I have two Pandas DataFrames, one where each column is a cumulative distribution (all entries between [0,1] and monotonically increasing) and second with the values associated to each cumulative distribution.
I need to access the values associated to different points in the cumulative distributions (percentiles). For example I could be interested in the percentiles [.1,.9] I'm finding the location of these percentiles in the DataFrame with the associated values by checking where in the first DataFrame I should insert the percentiles. This gives me a 2-d numpy array where each column has the location of the row for that column.
How can I use this array to access the values in the DataFrame? Is there a better way to access the values in one of the DataFrames based on where the percentile is located in the first DataFrame?
import pandas pd
import numpy as np
cdfs = pd.DataFrame([[.1,.2],[.4,.3],[.8,.7],[1.0,1.0]])
df1 = pd.DataFrame([[-10.0,-8.0],[1.4,3.3],[5.8,8.7],[11.0,15.0]])
percentiles = [0.15,0.75]
spots = np.apply_along_axis(np.searchsorted,0,cdfs,percentiles)
This does not work:
df1[spots]
Expected output:
[[1.4 -8.0]
[5.8 15.0]]
This does work, but seems cumbersome:
output = pd.DataFrame(index=percentiles,columns=df1.columns)
for column in range(spots.shape[1]):
output.loc[percentiles,column] = df1.loc[spots[:,column],column].values
try this:
df1.values[spots, [0, 1]]

Categories

Resources