Pandas - convert dataframe to square matrix with pairwise combinations from index - python

I am converting a data frame to a square matrix. The data frame has an index and only one column with floats. What I need to do is to calculate all pairs of indices, and for each pair take the mean of two associated column values. So, the usual pivot function is only part of the solution.
Currently, the function has an estimated complexity of O(n^2), which is not good as I have to work with larger inputs with data frames with several hundred rows at a time. Is there another faster approach I could take?
Example input (with integers here for simplicity):
df = pd.DataFrame([3, 4, 5])
Update: transformation logic
For an input data frame in the example:
0
0 3
1 4
2 5
I do the following (not claiming it is the best way though):
get all pairs of indices: (0,1), (1,2), (0,2)
for each pair, compute the mean of their values: (0,1):3.5, (1,2):4.5, (0,2):4.0
build a square symmetric matrix using indices in each pair as column and row identifiers, and zero on the diagonal (as shown in the desired output).
The code is in the turn_table_into_square_matrix().
Desired output:
0 1 2
0 0.0 3.5 4.0
1 3.5 0.0 4.5
2 4.0 4.5 0.0
Current implementation:
import pandas as pd
from itertools import combinations
import time
import string
import random
def turn_table_into_square_matrix(original_dataframe):
# get all pairs of indices
index_pairs = list(combinations(list(original_dataframe.index),2))
rows_for_final_dataframe = []
# collect new data frame row by row - the time consuming part
for pair in index_pairs:
subset_original_dataframe = original_dataframe[original_dataframe.index.isin(list(pair))]
rows_for_final_dataframe.append([pair[0], pair[1], subset_original_dataframe[0].mean()])
rows_for_final_dataframe.append([pair[1], pair[0], subset_original_dataframe[0].mean()])
final_dataframe = pd.DataFrame(rows_for_final_dataframe)
final_dataframe.columns = ["from", "to", "weight"]
final_dataframe_pivot = final_dataframe.pivot(index="from", columns="to", values="weight")
final_dataframe_pivot = final_dataframe_pivot.fillna(0)
return final_dataframe_pivot
Code to time the performance:
for size in range(50, 600, 100):
index = range(size)
values = random.sample(range(0, 1000), size)
example = pd.DataFrame(values, index)
print ("dataframe size", example.shape)
start_time = time.time()
turn_table_into_square_matrix(example)
print ("conversion time:", time.time()-start_time)
The timing results:
dataframe size (50, 1)
conversion time: 0.5455281734466553
dataframe size (150, 1)
conversion time: 5.001590013504028
dataframe size (250, 1)
conversion time: 14.562285900115967
dataframe size (350, 1)
conversion time: 31.168692111968994
dataframe size (450, 1)
conversion time: 49.07127499580383
dataframe size (550, 1)
conversion time: 78.73740792274475
Thus, a data frame of with 50 rows takes only half a second to convert, whereas one with 550 rows (11 times longer) takes 79 seconds (over 11^2 times longer). Is there a faster solution to this problem?

I don't think it is possible to do better than O(n^2) for that computation. As #piiipmatz suggested, you should try doing everything with numpy and then put the result in a pd.DataFrame. Your problem sounds like a good use case for numpy.add.at.
Here is a quick example
import numpy as np
import itertools
# your original array
x = np.array([1, 4, 8, 99, 77, 23, 4, 45])
n = len(x)
# all pairs of indices in x
a, b = zip(*list(itertools.product(range(n), range(n))))
a, b = np.array(a), np.array(b)
# resulting matrix
result = np.zeros(shape=(n, n))
np.add.at(result, [a, b], (x[a] + x[b]) / 2.0)
print(result)
# [[ 1. 2.5 4.5 50. 39. 12. 2.5 23. ]
# [ 2.5 4. 6. 51.5 40.5 13.5 4. 24.5]
# [ 4.5 6. 8. 53.5 42.5 15.5 6. 26.5]
# [ 50. 51.5 53.5 99. 88. 61. 51.5 72. ]
# [ 39. 40.5 42.5 88. 77. 50. 40.5 61. ]
# [ 12. 13.5 15.5 61. 50. 23. 13.5 34. ]
# [ 2.5 4. 6. 51.5 40.5 13.5 4. 24.5]
# [ 23. 24.5 26.5 72. 61. 34. 24.5 45. ]]

I think you have a lot of overhead from pandas (i.e. original_dataframe[original_dataframe.index.isin(list(pair))] seems too expensive for what it actually does). I haven't tested it but I assume you can save a considerable amount of execution time when you just work with numpy arrays. If needed you can still feed it to a pandas.DataFrame at the end.
Something like (just to sketch what I mean):
original_array = original_dataframe.as_matrix().ravel()
n = len(original_array)
final_matrix = np.zeros((n,n))
for pair in pairs:
final_matrix[pair[0], pair[1]] = 0.5*(original_array[pair[0]]+original_array[pair[1]])

How about this:
df.pivot(index='i', columns = 'j', values = 'empty')
for this you need to cheat a bit the standard pivot with adding new index columns (twice) as it does not allow the same argument twice in pivot and adding an empty column for values:
df['i']=df.index
df['j']=df.index
df['empty']=None
And that's it.

Related

How to compare if any value is similar to any other using numpy

I have many pairs of coordinate arrays like so
a=[(1.001,3),(1.334, 4.2),...,(17.83, 3.4)]
b=[(1.002,3.0001),(1.67, 5.4),...,(17.8299, 3.4)]
c=[(1.00101,3.002),(1.3345, 4.202),...,(18.6, 12.511)]
Any coordinate in any of the pairs can be a duplicate of another coordinate in another array of pairs. The arrays are also not the same size.
The duplicates will vary slightly in their value and for an example, I would consider the first value in a, b and c to be duplicates.
I could iterate through each array and compare the values one by one using numpy.isclose, however that will be slow.
Is there an efficient way to tackle this problem, hopefully using numpy to keep computing times low?
you might wanna try the round() function which will round off the numbers in your lists to the nearest integers.
the next thing that I'd suggest might be too extreme:
concat the arrays and put them into a pandas dataframe and drop_duplicates()
this might not be the solution you want
You might want to take a look at numpy.testing if you allow for AsertionError handling.
from numpy import testing as ts
a = np.array((1.001,3))
b = np.array((1.000101, 3.002))
ts.assert_array_almost_equal(a, b, decimal=1) # output None
but
ts.assert_array_almost_equal(a, b, decimal=3)
results in
AssertionError:
Arrays are not almost equal to 3 decimals
Mismatch: 50%
Max absolute difference: 0.002
Max relative difference: 0.00089891
x: array([1.001, 3. ])
y: array([1. , 3.002])
There are some more interesting functions from numpy.testing. Make sure to take a look at the docs.
I'm using pandas to give you an intuitive result, rather than just numbers. Of course you can expand the solution to your need
Say you create a pd.DataFrame from each array, and tag them from which array each belongs to. I am rounding the results to 2 decimal places, you may use whatever tolerance you want
dfa = pd.DataFrame(a).round(2)
dfa['arr'] = 'a'
Then, by concatenating, using duplicated and sorting, you may find an intuitive Dataframe that might fulfill your needs
df = pd.concat([dfa, dfb, dfc])
df[df.duplicated(subset=[0,1], keep=False)].sort_values(by=[0,1])
yields
x y arr
0 1.00 3.0 a
0 1.00 3.0 b
0 1.00 3.0 c
1 1.33 4.2 a
1 1.33 4.2 c
2 17.83 3.4 a
2 17.83 3.4 b
The indexes are duplicated, so you can simply use reset_index() at the end and use the newly-generated column as a parameter that indicates the corresponding index on each array. I.e.:
index x y arr
0 0 1.00 3.0 a
1 0 1.00 3.0 b
2 0 1.00 3.0 c
3 1 1.33 4.2 a
4 1 1.33 4.2 c
5 2 17.83 3.4 a
6 2 17.83 3.4 b
So, for example, line 0 indicates a duplicate coordinate, and is found on index 0 of arr a. Line 1 also indicates a dupe coordinate, found or index 0 of arr b, etc.
Now, if you just want to delete the duplicates and get one final array with only non-duplicate values, you may usedrop_duplicates
df.drop_duplicates(subset=[0,1])[[0,1]].to_numpy()
which yields
array([[ 1. , 3. ],
[ 1.33, 4.2 ],
[17.83, 3.4 ],
[ 1.67, 5.4 ],
[18.6 , 12.51]])

Is it possible to use pandas.DataFrame.rolling with a step greater than 1?

In R you can compute a rolling mean with a specified window that can shift by a specified amount each time.
However maybe I just haven't found it anywhere but it doesn't seem like you can do it in pandas or some other Python library?
Does anyone know of a way around this? I'll give you an example of what I mean:
Here we have bi-weekly data, and I am computing the two month moving average that shifts by 1 month which is 2 rows.
So in R I would do something like: two_month__movavg=rollapply(mydata,4,mean,by = 2,na.pad = FALSE)
Is there no equivalent in Python?
EDIT1:
DATE A DEMAND ... AA DEMAND A Price
0 2006/01/01 00:30:00 8013.27833 ... 5657.67500 20.03
1 2006/01/01 01:00:00 7726.89167 ... 5460.39500 18.66
2 2006/01/01 01:30:00 7372.85833 ... 5766.02500 20.38
3 2006/01/01 02:00:00 7071.83333 ... 5503.25167 18.59
4 2006/01/01 02:30:00 6865.44000 ... 5214.01500 17.53
So, I know it is a long time since the question was asked, by I bumped into this same problem and when dealing with long time series you really would want to avoid the unnecessary calculation of the values you are not interested at. Since Pandas rolling method does not implement a step argument, I wrote a workaround using numpy.
It is basically a combination of the solution in this link and the indexing proposed by BENY.
def apply_rolling_data(data, col, function, window, step=1, labels=None):
"""Perform a rolling window analysis at the column `col` from `data`
Given a dataframe `data` with time series, call `function` at
sections of length `window` at the data of column `col`. Append
the results to `data` at a new columns with name `label`.
Parameters
----------
data : DataFrame
Data to be analyzed, the dataframe must stores time series
columnwise, i.e., each column represent a time series and each
row a time index
col : str
Name of the column from `data` to be analyzed
function : callable
Function to be called to calculate the rolling window
analysis, the function must receive as input an array or
pandas series. Its output must be either a number or a pandas
series
window : int
length of the window to perform the analysis
step : int
step to take between two consecutive windows
labels : str
Name of the column for the output, if None it defaults to
'MEASURE'. It is only used if `function` outputs a number, if
it outputs a Series then each index of the series is going to
be used as the names of their respective columns in the output
Returns
-------
data : DataFrame
Input dataframe with added columns with the result of the
analysis performed
"""
x = _strided_app(data[col].to_numpy(), window, step)
rolled = np.apply_along_axis(function, 1, x)
if labels is None:
labels = [f"metric_{i}" for i in range(rolled.shape[1])]
for col in labels:
data[col] = np.nan
data.loc[
data.index[
[False]*(window-1)
+ list(np.arange(len(data) - (window-1)) % step == 0)],
labels] = rolled
return data
def _strided_app(a, L, S): # Window len = L, Stride len/stepsize = S
"""returns an array that is strided
"""
nrows = ((a.size-L)//S)+1
n = a.strides[0]
return np.lib.stride_tricks.as_strided(
a, shape=(nrows, L), strides=(S*n, n))
You can using rolling again, just need a little bit work with you assign index
Here by = 2
by = 2
df.loc[df.index[np.arange(len(df))%by==1],'New']=df.Price.rolling(window=4).mean()
df
Price New
0 63 NaN
1 92 NaN
2 92 NaN
3 5 63.00
4 90 NaN
5 3 47.50
6 81 NaN
7 98 68.00
8 100 NaN
9 58 84.25
10 38 NaN
11 15 52.75
12 75 NaN
13 19 36.75
If the data size is not too large, here is an easy way:
by = 2
win = 4
start = 3 ## it is the index of your 1st valid value.
df.rolling(win).mean()[start::by] ## calculate all, choose what you need.
Now this is a bit of overkill for a 1D array of data, but you can simplify it and pull out what you need. Since pandas can rely on numpy, you might want to check to see how their rolling/strided function if implemented.
Results for 20 sequential numbers. A 7 day window, striding/sliding by 2
z = np.arange(20)
z #array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
s = stride(z, (7,), (2,))
np.mean(s, axis=1) # array([ 3., 5., 7., 9., 11., 13., 15.])
Here is the code I use without the major portion of the documentation. It is derived from many implementations of strided function in numpy that can be found on this site. There are variants and incarnation, this is just another.
def stride(a, win=(3, 3), stepby=(1, 1)):
"""Provide a 2D sliding/moving view of an array.
There is no edge correction for outputs. Use the `pad_` function first."""
err = """Array shape, window and/or step size error.
Use win=(3,) with stepby=(1,) for 1D array
or win=(3,3) with stepby=(1,1) for 2D array
or win=(1,3,3) with stepby=(1,1,1) for 3D
---- a.ndim != len(win) != len(stepby) ----
"""
from numpy.lib.stride_tricks import as_strided
a_ndim = a.ndim
if isinstance(win, int):
win = (win,) * a_ndim
if isinstance(stepby, int):
stepby = (stepby,) * a_ndim
assert (a_ndim == len(win)) and (len(win) == len(stepby)), err
shp = np.array(a.shape) # array shape (r, c) or (d, r, c)
win_shp = np.array(win) # window (3, 3) or (1, 3, 3)
ss = np.array(stepby) # step by (1, 1) or (1, 1, 1)
newshape = tuple(((shp - win_shp) // ss) + 1) + tuple(win_shp)
newstrides = tuple(np.array(a.strides) * ss) + a.strides
a_s = as_strided(a, shape=newshape, strides=newstrides, subok=True).squeeze()
return a_s
I failed to point out that you can create an output that you could append as a column into pandas. Going back to the original definitions used above
nans = np.full_like(z, np.nan, dtype='float') # z is the 20 number sequence
means = np.mean(s, axis=1) # results from the strided mean
# assign the means to the output array skipping the first and last 3 and striding by 2
nans[3:-3:2] = means
nans # array([nan, nan, nan, 3., nan, 5., nan, 7., nan, 9., nan, 11., nan, 13., nan, 15., nan, nan, nan, nan])
Since pandas 1.5.0, there is a step parameter to rolling() that should do the trick. See: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html
Using Pandas.asfreq() after rolling

get index value of numpy.ndarrays and do math on it in a one line condition

given the following numpy.ndarrays of identical length
nparray_upper = [ 5.2 4.9 7.6 10.1]
nparray_base = [ 2.2 2.6 5.5 11.02]
nparray_lower = [ 4.3 1.4 3.2 8.9]
and a fixed size variable
multiplier = 10
how do i multiply the index of each with the multiplier based on a condition?
indexMultiplierCondition = np.where(((nparray_base <= nparray_upper) & (nparray_base >= nparray_lower)), INDEX * multiplier, 0).sum()
the above should return
indexMultiplierCondition = 30
because only 2.6 and 5.5 in nparray_base are within the upper and lower level and the sum of their index 1 and 2 multiplied by 10 is 30
this should be as efficient as possible
np.where returns a tuple.
So, you can retrieve the first element of the tuple (which is a np.ndarray) and multiply by a scalar value of your choice.
For example,
i = np.where(((b<=a) & (c<=b)))
(array([1, 2], dtype=int64),)
i[0] * m
array([10, 20], dtype=int64)
(i[0] * m).sum()
30

Efficient, large-scale competition scoring in Python

Consider a large dataframe of scores S containing entries like the following. Each row represents a contest between a subset of the participants A, B, C and D.
A B C D
0.1 0.3 0.8 1
1 0.2 NaN NaN
0.7 NaN 2 0.5
NaN 4 0.6 0.8
The way to read the matrix above is: looking at the first row, the participant A scored 0.1 in that round, B scored 0.3, and so forth.
I need to build a triangular matrix C where C[X,Y] stores how much better participant X was than participant Y. More specifically, C[X,Y] would hold the mean % difference in score between X and Y.
From the example above:
C[A,B] = 100 * ((0.1 - 0.3)/0.3 + (1 - 0.2)/0.2) = 33%
My matrix S is huge, so I am hoping to take advantage of JIT (Numba?) or built-in methods in numpy or pandas. I certainly want to avoid having a nested loop, since S has millions of rows.
Does an efficient algorithm for the above have a name?
Let's look at a NumPy based solution and thus let's assume that the input data is in an array named a. Now, the number of pairwise combinations for 4 such variables would be 4*3/2 = 6. We can generate the IDs corresponding to such combinations with np.triu_indices(). Then, we index into the columns of a with those indices. We perform the subtractions and divisions and simply add the columns ignoring the NaN affected results with np.nansum() for the desired output.
Thus, we would have an implementation like so -
R,C = np.triu_indices(a.shape[1],1)
out = 100*np.nansum((a[:,R] - a[:,C])/a[:,C],0)
Sample run -
In [121]: a
Out[121]:
array([[ 0.1, 0.3, 0.8, 1. ],
[ 1. , 0.2, nan, nan],
[ 0.7, nan, 2. , 0.5],
[ nan, 4. , 0.6, 0.8]])
In [122]: out
Out[122]:
array([ 333.33333333, -152.5 , -50. , 504.16666667,
330. , 255. ])
In [123]: 100 * ((0.1 - 0.3)/0.3 + (1 - 0.2)/0.2) # Sample's first o/p elem
Out[123]: 333.33333333333337
If you need the output as (4,4) array, we can use Scipy's squareform -
In [124]: from scipy.spatial.distance import squareform
In [125]: out2D = squareform(out)
Let's convert to a pandas dataframe for a good visual feedback -
In [126]: pd.DataFrame(out2D,index=list('ABCD'),columns=list('ABCD'))
Out[126]:
A B C D
A 0.000000 333.333333 -152.500000 -50
B 333.333333 0.000000 504.166667 330
C -152.500000 504.166667 0.000000 255
D -50.000000 330.000000 255.000000 0
Let's compute [B,C] manually and check back -
In [127]: 100 * ((0.3 - 0.8)/0.8 + (4 - 0.6)/0.6)
Out[127]: 504.1666666666667

Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array

Question
Is there a good way to transform a DataFrame with an n-level index into an n-D Numpy array (a.k.a n-tensor)?
Example
Suppose I set up a DataFrame like
from pandas import DataFrame, MultiIndex
index = range(2), range(3)
value = range(2 * 3)
frame = DataFrame(value, columns=['value'],
index=MultiIndex.from_product(index)).drop((1, 0))
print frame
which outputs
value
0 0 0
1 1
2 3
1 1 5
2 6
The index is a 2-level hierarchical index. I can extract a 2-D Numpy array from the data using
print frame.unstack().values
which outputs
[[ 0. 1. 2.]
[ nan 4. 5.]]
How does this generalize to an n-level index?
Playing with unstack(), it seems that it can only be used to massage the 2-D shape of the DataFrame, but not to add an axis.
I cannot use e.g. frame.values.reshape(x, y, z), since this would require that the frame contains exactly x * y * z rows, which cannot be guaranteed. This is what I tried to demonstrate by drop()ing a row in the above example.
Any suggestions are highly appreciated.
Edit. This approach is much more elegant (and two orders of magnitude faster) than the one I gave below.
# create an empty array of NaN of the right dimensions
shape = map(len, frame.index.levels)
arr = np.full(shape, np.nan)
# fill it using Numpy's advanced indexing
arr[frame.index.codes] = frame.values.flat
# ...or in Pandas < 0.24.0, use
# arr[frame.index.labels] = frame.values.flat
Original solution. Given a setup similar to above, but in 3-D,
from pandas import DataFrame, MultiIndex
from itertools import product
index = range(2), range(2), range(2)
value = range(2 * 2 * 2)
frame = DataFrame(value, columns=['value'],
index=MultiIndex.from_product(index)).drop((1, 0, 1))
print(frame)
we have
value
0 0 0 0
1 1
1 0 2
1 3
1 0 0 4
1 0 6
1 7
Now, we proceed using the reshape() route, but with some preprocessing to ensure that the length along each dimension will be consistent.
First, reindex the data frame with the full cartesian product of all dimensions. NaN values will be inserted as needed. This operation can be both slow and consume a lot of memory, depending on the number of dimensions and on the size of the data frame.
levels = map(tuple, frame.index.levels)
index = list(product(*levels))
frame = frame.reindex(index)
print(frame)
which outputs
value
0 0 0 0
1 1
1 0 2
1 3
1 0 0 4
1 NaN
1 0 6
1 7
Now, reshape() will work as intended.
shape = map(len, frame.index.levels)
print(frame.values.reshape(shape))
which outputs
[[[ 0. 1.]
[ 2. 3.]]
[[ 4. nan]
[ 6. 7.]]]
The (rather ugly) one-liner is
frame.reindex(list(product(*map(tuple, frame.index.levels)))).values\
.reshape(map(len, frame.index.levels))
This can be done quite nicely using the Python xarray package which can be found here: http://xarray.pydata.org/en/stable/. It has great integration with Pandas and is quite intuitive once you get to grips with it.
If you have a multiindex series you can call the built-in method multiindex_series.to_xarray() (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_xarray.html). This will generate a DataArray object, which is essentially a name-indexed numpy array, using the index values and names as coordinates. Following this you can call .values on the DataArray object to get the underlying numpy array.
If you need your tensor to conform to a set of keys in a specific order, you can also call .reindex(index_name = index_values_in_order) (http://xarray.pydata.org/en/stable/generated/xarray.DataArray.reindex.html) on the DataArray. This can be extremely useful and makes working with the newly generated tensor much easier!

Categories

Resources