Storing upper/lower half of correlation matrix - python

What is the most { (1) memory efficient (2) time efficient (3) easy-to-access* } way to store the upper/lower half of the correlation matrix to a file in python ?
(By "easy-to-access" I mean- to be able to read from the file and plot the correlation matrix using matplotlib/seaborn)
Example, for a correlation matrix below:
C1 C2 C3 C4
C1 1.0 0.6 0.7 0.5
C2 0.6 1.0 0.4 0.9
C3 0.7 0.4 1.0 0.3
C4 0.5 0.9 0.3 1.0
I want to store the below numbers to a file.
C2 C3 C4
C1 0.6 0.7 0.5
C2 0.4 0.9
C3 0.3
OR
C1 C2 C3
C2 0.6
C3 0.7 0.4
C4 0.5 0.9 0.3
(I thought of storing it as a csv/tsv file but it will still eat up memory for blank characters which will be there for the other half of the matrix.)

You need somthing like this:
matrix = np.array([[1, 0.6, 0.7, 0.5],
[0.6, 1, 0.4, 0.9],
[0.7, 0.4, 1, 0.3],
[0.5, 0.9, 0.3, 1]])
ut = np.triu(matrix, k=1)
lt = np.tril(matrix, k=-1)
ut = np.where(ut==0, np.nan, ut)
lt = np.where(lt==0, np.nan, lt)
np.savetxt("upper.csv", ut, delimiter=",")
np.savetxt("lower.csv", lt, delimiter=",")

Use the second representation. Its just the transpose of the first and you don't need to store any blank characters for the other half. If blanks characters is your concern, write a custom file writer/reader for your matrix.
Example:
mat = []
mat.append(["C1", "C2", "C3"])
mat.append(["C2", 0.6])
mat.append(["C3", 0.7, 0.4])
mat.append(["C4", 0.5, 0.9, 0.3])
print(mat)
with open("correlation.txt", "w") as _file:
for row in mat:
_file.write("\t".join(str(val) for val in row))
_file.write("\n") # you will not have blank characters
with open("correlation.txt", "r") as _file:
for line in _file.readlines():
print(len(line.split()))
Result:
[['C1', 'C2', 'C3'], ['C2', 0.6], ['C3', 0.7, 0.4], ['C4', 0.5, 0.9, 0.3]]
3
2
3
4

Related

Pandas efficiently add new column true/false if between two other columns

Using Pandas, how can I efficiently add a new column that is true/false if the value in one column (x) is between the values in two other columns (low and high)?
The np.select approach from here works perfectly, but I "feel" like there should be a one-liner way to do this.
Using Python 3.7
fid = [0, 1, 2, 3, 4]
x = [0.18, 0.07, 0.11, 0.3, 0.33]
low = [0.1, 0.1, 0.1, 0.1, 0.1]
high = [0.2, 0.2, 0.2, 0.2, 0.2]
test = pd.DataFrame(data=zip(fid, x, low, high), columns=["fid", "x", "low", "high"])
conditions = [(test["x"] >= test["low"]) & (test["x"] <= test["high"])]
labels = ["True"]
test["between"] = np.select(conditions, labels, default="False")
display(test)
Like mentioned by #Brebdan, you can use this builtin:
test["between"] = test["x"].between(test["low"], test["high"])
output:
fid x low high between
0 0 0.18 0.1 0.2 True
1 1 0.07 0.1 0.2 False
2 2 0.11 0.1 0.2 True
3 3 0.30 0.1 0.2 False
4 4 0.33 0.1 0.2 False

Averaging values with irregular time intervals

I have several pairs of arrays of measurements and the times at which the measurements were taken that I want to average. Unfortunately the times at which these measurements were taken isn't regular or the same for each pair.
My idea for averaging them is to create a new array with the value at each second then average these. It works but it seems a bit clumsy and means I have to create many unnecessarily long arrays.
Example Inputs
m1 = [0.4, 0.6, 0.2]
t1 = [0.0, 2.4, 5.2]
m2 = [1.0, 1.4, 1.0]
t2 = [0.0, 3.6, 4.8]
Generated Regular Arrays for values at each second
r1 = [0.4, 0.4, 0.4, 0.6, 0.6, 0.6, 0.2]
r2 = [1.0, 1.0, 1.0, 1.0, 1.4, 1.0]
Average values up to length of shortest array
a = [0.7, 0.7, 0.7, 0.8, 1.0, 0.8]
My attempt given list of measurement arrays measurements and respective list of time interval arrays times
def granulate(values, times):
count = 0
regular_values = []
for index, x in enumerate(times):
while count <= x:
regular_values.append(values[index])
count += 1
return np.array(regular_values)
processed_measurements = [granulate(m, t) for m, t in zip(measurements, times)]
min_length = min(len(m) for m in processed_measurements )
processed_measurements = [m[:min_length] for m in processed_measurements]
average_measurement = np.mean(processed_measurements, axis=0)
Is there a better way to do it, ideally using numpy functions?
This will average to closest second:
time_series = np.arange(np.stack((t1, t2)).max())
np.mean([m1[abs(t1-time_series[:,None]).argmin(axis=1)], m2[abs(t2-time_series[:,None]).argmin(axis=1)]], axis=0)
If you want to floor times to each second (with possibility of generalizing to more arrays):
m = [m1, m2]
t = [t1, t2]
m_t=[]
time_series = np.arange(np.stack(t).max())
for i in range(len(t)):
time_diff = time_series-t[i][:,None]
m_t.append(m[i][np.where(time_diff > 0, time_diff, np.inf).argmin(axis=0)])
average = np.mean(m_t, axis=0)
output:
[0.7 0.7 0.7 0.8 1. 0.8]
You can do (a bit more numpy-ish solution):
import numpy as np
# oddly enough - numpy doesn't have it's own ffill function:
def np_ffill(arr):
mask = np.arange(len(arr))
mask[np.isnan(arr)]=0
np.maximum.accumulate(mask, axis=0, out=mask)
return arr[mask]
t1=np.ceil(t1).astype("int")
t2=np.ceil(t2).astype("int")
r1=np.empty(max(t1)+1)
r2=np.empty(max(t2)+1)
r1[:]=np.nan
r2[:]=np.nan
r1[t1]=m1
r2[t2]=m2
r1=np_ffill(r1)
r2=np_ffill(r2)
>>> print(r1,r2)
[0.4 0.4 0.4 0.6 0.6 0.6 0.2] [1. 1. 1. 1. 1.4 1. ]
#in order to get avg:
r3=np.vstack([r1[:len(r2)],r2[:len(r1)]]).mean(axis=0)
>>> print(r3)
[0.7 0.7 0.7 0.8 1. 0.8]
I see two possible solutions:
Create a 'bucket' for each time step, lets say 1 second, and insert all measurements that were taken at the time step +/- 1 second in the bucket. Average all values in the bucket.
Interpolate every measurement row, so that they have equal time steps. Average all measurements for every time step

Sort list based on another list

I have two lists in python3.6, and I would like to sort w by considering d values. This is similar to this question,
Sorting list based on values from another list? , though, I could not use zip because w and d are not paired data.
I have a code sample, and want to get t variable.
Updated
I could do it by using for loop. Is there any fasterh way?
import numpy as np
w = np.arange(0.0, 1.0, 0.1)
t = np.zeros(10)
d = np.array([3.1, 0.2, 5.3, 2.2, 4.9, 6.1, 7.7, 8.1, 1.3, 9.4])
ind = np.argsort(d)
print('w', w)
print('d', d)
for i in range(10):
t[ind[i]] = w[i]
print('t', t)
#w [ 0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9]
#d [ 3.1 0.2 5.3 2.2 4.9 6.1 7.7 8.1 1.3 9.4]
#ht [ 0.3 0. 0.5 0.2 0.4 0.6 0.7 0.8 0.1 0.9]
Use argsort like so:
>>> t = np.empty_like(w)
>>> t[d.argsort()] = w
>>> t
array([0.3, 0. , 0.5, 0.2, 0.4, 0.6, 0.7, 0.8, 0.1, 0.9])
They are paired data, but in the opposite direction.
Make a third list, i, np.arange(0, 10).
zip this with d.
Sort the tuples with d as the sort key; i still holds the original index of each d element.
zip this with w.
Sort the triples (well, pairs with a pair as one element) with i as the sort key.
Extract the w values in their new order; this is your t array.
The answers for this question are fantastic, but I feel it is prudent to point out you are not doing what you think you are doing.
What you want to do: (or at least what I gather) You want t to contain the values of w rearranged to be in the sorted order of d
What you are doing: Filling out t in the sorted order of d, with elements of w. You are only changing the order of how t gets filled up. You are not reflecting the sort of d into w on t
Consider a small variation in your for loop
for i in range(0,10):
t[i] = w[ind[i]]
This outputs a t
('t', array([0.1, 0.8, 0.3, 0. , 0.4, 0.2, 0.5, 0.6, 0.7, 0.9]))
You can just adapt PaulPanzer's answer to this as well.

Differences between dataframe spearman correlation using pandas and scipy

I have a fairly big matrix (4780, 5460) and computed the spearman correlation between rows using both "pandas.DataFrame.corr" and "scipy.stats.spearmanr". Each function return very different correlation coeficients, and now I am not sure which is the "correct", or if my dataset it more suitable to a different implementation.
Some contextualization: the vectors (rows) I want to test for correlation do not necessarily have all same points, there are NaN in some columns and not in others.
df.T.corr(method='spearman')
(r, p) = spearmanr(df.T)
df2 = pd.DataFrame(index=df.index, columns=df.columns, data=r)
In[47]: df['320840_93602.563']
Out[47]:
320840_93602.563 1.000000
3254_642.148.peg.3256 0.565812
13752_42938.1206 0.877192
319002_93602.870 0.225530
328_642.148.peg.330 0.658269
...
12566_42938.19 0.818395
321125_93602.2882 0.535577
319185_93602.1135 0.678397
29724_39.3584 0.770453
321030_93602.1962 0.738722
Name: 320840_93602.563, dtype: float64
In[32]: df2['320840_93602.563']
Out[32]:
320840_93602.563 1.000000
3254_642.148.peg.3256 0.444675
13752_42938.1206 0.286933
319002_93602.870 0.225530
328_642.148.peg.330 0.606619
...
12566_42938.19 0.212265
321125_93602.2882 0.587409
319185_93602.1135 0.696172
29724_39.3584 0.097753
321030_93602.1962 0.163417
Name: 320840_93602.563, dtype: float64
scipy.stats.spearmanr is not designed to handle nan, and its behavior with nan values is undefined. [Update: scipy.stats.spearmanr now has the argument nan_policy.]
For data without nans, the functions appear to agree:
In [92]: np.random.seed(123)
In [93]: df = pd.DataFrame(np.random.randn(5, 5))
In [94]: df.T.corr(method='spearman')
Out[94]:
0 1 2 3 4
0 1.0 -0.8 0.8 0.7 0.1
1 -0.8 1.0 -0.7 -0.7 -0.1
2 0.8 -0.7 1.0 0.8 -0.1
3 0.7 -0.7 0.8 1.0 0.5
4 0.1 -0.1 -0.1 0.5 1.0
In [95]: rho, p = spearmanr(df.values.T)
In [96]: rho
Out[96]:
array([[ 1. , -0.8, 0.8, 0.7, 0.1],
[-0.8, 1. , -0.7, -0.7, -0.1],
[ 0.8, -0.7, 1. , 0.8, -0.1],
[ 0.7, -0.7, 0.8, 1. , 0.5],
[ 0.1, -0.1, -0.1, 0.5, 1. ]])

Probability tensor multiplication using pandas.DataFrame

I'm looking for a good way to store and use conditional probabilities in python.
I'm thinking of using a pandas dataframe. If the conditional probabilities of some X are P(X=A|P1=1, P2=1) = 0.2, P(X=B|P1=2, P2=1) = 0.9 etc., I would use the dataframe
A B
P1 P2
1 1 0.2 0.8
2 0.5 0.5
2 1 0.9 0.1
2 0.9 0.1
and given the marginal probabilities of P1 and P2 as Series
1 0.4
2 0.6
Name: P1
1 0.7
2 0.3
Name: P2
I would like to obtain the Series of marginal probabilities of X, i.e. the series
A 0.602
B 0.398
Name: X
I can get what I want by
X = sum(
sum(
X.xs(i, level="P1")*P1[i]
for i in P1.index
).xs(j)*P2[j]
for j in P2.index
)
X.name="X"
but this is not easily generalizable to more dependencies, the asymmetry between the first xs with level and the second one without looks weird and as usual when working with pandas I'm very sure that there is a better solution using it's tricks and methods.
Is pandas a good tool for this, should I represent my data in another way, and what is the best way to do this calculation, which is essentially an indexed tensor product, in pandas?
One way to vectorize is access the values in Series P1 and P2 by indexing with an array of labels.
In [20]: df = X.reset_index()
In [21]: mP1 = P1[df.P1].values
In [22]: mP2 = P2[df.P2].values
In [23]: mP1
Out[23]: array([ 0.4, 0.4, 0.6, 0.6])
In [24]: mP2
Out[24]: array([ 0.7, 0.3, 0.7, 0.3])
In [25]: mp = mP1 * mP2
In [26]: mp
Out[26]: array([ 0.28, 0.12, 0.42, 0.18])
In [27]: X.mul(mp, axis=0)
Out[27]:
A B
P1 P2
1 1 0.056 0.224
2 0.060 0.060
2 1 0.378 0.042
2 0.162 0.018
In [28]: X.mul(mp, axis=0).sum()
Out[28]:
A 0.656
B 0.344
In [29]: sum(
sum(
X.xs(i, level="P1")*P1[i]
for i in P1.index
).xs(j)*P2[j]
for j in P2.index
)
Out[29]:
A 0.656
B 0.344
(Alternately, access the values of a MultiIndex
without resetting the index as follows.)
In [38]: P1[X.index.get_level_values("P1")].values
Out[38]: array([ 0.4, 0.4, 0.6, 0.6])

Categories

Resources