Related
I have a large dataset with thousands of rows though fewer columns, i have ordered them by row values so that each of the 'objects' are grouped together, just like the dataset in Table1 below:
#Table1 :
data = [['ALFA', 351740.00, 0.31, 0.22, 0.44, 0.19, 0.05],
['ALFA', 401740.00, 0.43, 0.26, 0.23, 0.16, 0.09],
['ALFA', 892350.00, 0.58, 0.24, 0.05, 0.07, 0.4],
['Bravo', 511830.00, 0.52, 0.16, 0.08, 0.26, 0],
['Charlie', 590030.00, 0.75, 0.2, 0.14, 0.37, 0.06],
['Charlie', 590030.00, 0.75, 0.2, 0.27, 0.2, 0.01],
['Charlie', 590030.00, 0.75, 0.2, 0.29, 0.11, 0.04],
['Charlie', 590030.00, 0.75, 0.2, 0.27, 0.2, 0.01],
['Charlie', 401740.00, 0.43, 0.26, 0.14, 0.37, 0.06],
['Charlie', 511830.00, 0.52, 0.16, 0.13, 0.22, 0.01],
['Delta', 590030.00, 0.75, 0.2, 0.34, 0.3, 0],
['Delta', 590030.00, 0.75, 0.2, 0, 0.28, 0],
['Delta', 351740.00, 0.31, 0.22, 0.44, 0.19, 0.05],
['Echo', 892350.00, 0.58, 0.24, 0.23, 0.16, 0.09],
['Echo', 590030.00, 0.75, 0.2, 0.05, 0.07, 0.4],
['Echo', 590030.00, 0.75, 0.2, 0.08, 0.26, 0],
['Echo', 590030.00, 0.75, 0.2, 0.14, 0.37, 0.06],
['Foxtrot', 401740.00, 0.43, 0.26, 0.27, 0.2, 0.01],
['Foxtrot', 511830.00, 0.52, 0.16, 0.29, 0.11, 0.04],
['Golf', 590030.00, 0.75, 0.2, 0.27, 0.2, 0.01],
['Golf', 590030.00, 0.75, 0.2, 0.14, 0.37, 0.06],
['Golf', 351740.00, 0.31, 0.22, 0.13, 0.22, 0.01],
['Hotel', 892350.00, 0.58, 0.24, 0.34, 0.3, 0],
['Hotel', 590030.00, 0.75, 0.2, 0, 0.28, 0],
['Hotel', 590030.00, 0.75, 0.2, 0.29, 0.11, 0.04]]
df = pd.DataFrame(data, columns= ['Objects', 'Column1', 'Column2', 'Column3', 'Column4', 'Column5', 'Column6'])
df
However i would like to partition the data by these objects and get the averages for all the columns only in a separate table much like Table2 below:
#Table2:
data2 = [['ALFA', 548610.00, 0.44, 0.24, 0.24, 0.14, 0.18],
['Bravo', 511830.00, 0.52, 0.16, 0.08, 0.26, 0],
['Charlie', 545615.00, 0.66, 0.20, 0.21, 0.25, 0.03],
['Delta', 510600.00, 0.60, 0.21, 0.26, 0.26, 0.02],
['Echo', 665610.00, 0.71, 0.21, 0.13, 0.22, 0.14],
['Foxtrot', 456785.00, 0.48, 0.21, 0.28, 0.16, 0.03],
['Golf', 510600.00, 0.60, 0.21, 0.18, 0.26, 0.03],
['Hotel', 690803.33, 0.69, 0.21, 0.21, 0.23, 0.01]]
df2 = pd.DataFrame(data, columns= ['Objects', 'Column1', 'Column2', 'Column3', 'Column4', 'Column5', 'Column6'])
df2
Please note that the number of the objects vary across the dataset so the query would be able to count the number of objects and use that number to get the average of all the columns for each object and then present all these values in a new table.
For instance note that the '548610.00' values in Table2 for ALFA(column1) is merely an addition of Column1 values of ALFA in Table1 (351740.00 + 401740.00 + 401740.00) and divide by the count of ALFA being '3'.
Just use the groupby() function from pandas:
df.groupby('Objects').mean()
Instead of mean() other functions like min() are possible as well.
Can be done with:
df2 = pd.DataFrame(
data,
columns=['Objects', 'Column1', 'Column2', 'Column3', 'Column4', 'Column5', 'Column6']
).groupby(by='Objects').mean().round(decimals=2).reset_index()
print(df2)
Output:
Objects Column1 Column2 Column3 Column4 Column5 Column6
0 ALFA 548610.00 0.44 0.24 0.24 0.14 0.18
1 Bravo 511830.00 0.52 0.16 0.08 0.26 0.00
2 Charlie 545615.00 0.66 0.20 0.21 0.24 0.03
3 Delta 510600.00 0.60 0.21 0.26 0.26 0.02
4 Echo 665610.00 0.71 0.21 0.12 0.22 0.14
5 Foxtrot 456785.00 0.48 0.21 0.28 0.16 0.02
6 Golf 510600.00 0.60 0.21 0.18 0.26 0.03
7 Hotel 690803.33 0.69 0.21 0.21 0.23 0.01
I have a numpy 2d-array of shape (N, N) representing a NxN correlation matrix. This matrix is symmetric. Say N=5, then an example of this 2d-array would be:
x = np.array([[1.00, 0.46, 0.89, 0.76, 0.65],
[0.46, 1.00, 0.83, 0.88, 0.29],
[0.89, 0.83, 1.00, 0.57, 0.84],
[0.76, 0.88, 0.57, 1.00, 0.39],
[0.65, 0.29, 0.84, 0.39, 1.00]])
I would like to obtain P copies of x where the diagonal remains the same but the upper- and lower-triangular halves of the matrix are permuted in unison.
An example of one of these copies could be:
np.array([[1.00, 0.65, 0.89, 0.84, 0.39],
[0.65, 1.00, 0.76, 0.83, 0.88],
[0.89, 0.76, 1.00, 0.29, 0.57],
[0.84, 0.83, 0.29, 1.00, 0.46],
[0.39, 0.88, 0.57, 0.46, 1.00]])
It would be great if the solution doesn't take too long as the matrix I am using is of shape (100, 100) and I would like to obtain 10,000-100,000 copies.
My intuition would be to somehow obtain the lower or upper half of the matrix as a flattened array, do the permutation, and replace values in both upper and lower halves. This, however, would take me a while to figure out and would like to know if there is a more straight-forward approach. Thanks.
You can try this:
import numpy as np
x = np.array([[1.00, 0.46, 0.89, 0.76, 0.65],
[0.46, 1.00, 0.83, 0.88, 0.29],
[0.89, 0.83, 1.00, 0.57, 0.84],
[0.76, 0.88, 0.57, 1.00, 0.39],
[0.65, 0.29, 0.84, 0.39, 1.00]])
j, i = np.meshgrid(np.arange(x.shape[0]), np.arange(x.shape[0]))
i, j = i.flatten(), j.flatten()
up_i, up_j = i[i < j], j[i< j]
elems = x[up_i, up_j]
np.random.shuffle(elems)
x[up_i, up_j] = elems
x[up_j, up_i] = elems
x
array([[1. , 0.57, 0.88, 0.46, 0.76],
[0.57, 1. , 0.39, 0.29, 0.83],
[0.88, 0.39, 1. , 0.65, 0.89],
[0.46, 0.29, 0.65, 1. , 0.84],
[0.76, 0.83, 0.89, 0.84, 1. ]])
In case all your xs are of same shape, you need to call meshgrid and find indices corresponding to upper triangle only once.
This uses numpy fancy indexing to fetch the non-diagonal elements.
I am trying to create a dictionary where the value for each key is two dictionaries.
I have two lists of patient (normal tissue, disease tissue) barcodes that correspond to columns of values in a dataframe. My goal is to match patients that are in both lists and then, for each patient found in both lists, append their normal and disease tissue values to a dictionary. The dictionary key would be the patient barcode and the dictionary value would be another dictionary of the normal tissue: values pulled from the dataframe and disease tissue: values pulled from the dataframe.
So starting with
In [3]: df = pd.DataFrame({'Patient1_Normal':['nan', 0.01, 0.1, 0.16, 0.88, 0.83, 0.82, 'nan'],
'Patient1_Disease':[0.12, 0.06, 0.19, 0.34, 'nan', 'nan', 0.73, 0.91],
'Patient2_Disease':['nan', 'nan', 'nan', 1.0, 0.24, 0.67, 0.97, 0.98],
'Patient3_Normal': [0.21, 0.25,0.63,0.92,0.3, 0.56, 0.78, 0.9],
'Patient3_Disease':[0.11, 0.45, 'nan', 0.45, 0.22, 0.89, 0.17, 0.12],
'Patient4_Normal':['nan', 0.35, 'nan', 0.22, 0.45, 0.66, 0.21, 0.91],
'Patient4_Disease':['nan', 'nan', 0.56, 0.72, 'nan', 0.97, 0.91, 0.79],
'Patient5_Disease': [0.34, 0.27, 'nan', 0.16, 0.32, 0.27, 0.55, 0.51]})
In [4]: df
Out[4]: Patient1_Normal Patient1_Disease Patient2_Disease Patient3_Normal Patient3_Disease Patient4_Normal Patient4_Disease Patient5_Disease
0 nan 0.12 nan 0.21 0.11 nan nan 0.34
1 0.01 0.06 nan 0.25 0.45 0.35 nan 0.27
2 0.1 0.19 nan 0.63 nan nan 0.56 nan
3 0.16 0.34 1 0.92 0.45 0.22 0.72 0.16
4 0.88 nan 0.24 0.30 0.22 0.45 nan 0.32
5 0.83 nan 0.67 0.56 0.89 0.66 0.97 0.27
6 0.82 0.73 0.97 0.78 0.17 0.21 0.91 0.55
7 nan 0.91 0.98 0.90 0.12 0.91 0.79 0.51
Here is what I have so far:
D_col = [col for col in df if '_Disease' in col]
N_col = [col for col in df if '_Normal' in col]
paired_patients = {}
psi_sets = {}
psi_sets['d'] = []
psi_sets['n'] = []
for patient in N_col:
patient_id = patient[0:8]
n_id = patient
d_id = [i for i in D_col if patient_id in i]
if len(d_id) > 0:
psi_sets['n'] = df[n_id].to_list()
for d in d_id:
psi_sets['d'] = df[d].to_list()
paired_patients[patient_id] = psi_sets
However, my paired_patients dictionary values are overwriting instead of appending, so the output for paired_patients looks like this:
{'Patient1': {'d': ['nan', 'nan', 0.56, 0.72, 'nan', 0.97, 0.91, 0.79],
'n': ['nan', 0.35, 'nan', 0.22, 0.45, 0.66, 0.21, 0.91]},
'Patient3': {'d': ['nan', 'nan', 0.56, 0.72, 'nan', 0.97, 0.91, 0.79],
'n': ['nan', 0.35, 'nan', 0.22, 0.45, 0.66, 0.21, 0.91]},
'Patient4': {'d': ['nan', 'nan', 0.56, 0.72, 'nan', 0.97, 0.91, 0.79],
'n': ['nan', 0.35, 'nan', 0.22, 0.45, 0.66, 0.21, 0.91]}}
How do I fix the last bit of code to append paired_patient dictionary values correctly for each patient, such that the paired_patient dictionary looks like:
{'Patient1': {'d': [0.12, 0.06, 0.19, 0.34, 'nan', 'nan', 0.73, 0.91],
'n': ['nan', 0.01, 0.1, 0.16, 0.88, 0.83, 0.82, 'nan']},
'Patient3': {'d': [0.11, 0.45, 'nan', 0.45, 0.22, 0.89, 0.17, 0.12],
'n': [0.21, 0.25,0.63,0.92,0.3, 0.56, 0.78, 0.9]},
'Patient4': {'nan', 'nan', 0.56, 0.72, 'nan', 0.97, 0.91, 0.79],
'n': ['nan', 0.35, 'nan', 0.22, 0.45, 0.66, 0.21, 0.91]}}
D_col = [col for col in df if '_Disease' in col]
N_col = [col for col in df if '_Normal' in col]
paired_patients = {}
for patient in N_col:
psi_sets = {}
patient_id = patient[0:8]
n_id = patient
d_id = [i for i in D_col if patient_id in i]
if len(d_id) > 0:
psi_sets['n'] = df[n_id].to_list()
for d in d_id:
psi_sets['d'] = df[d].to_list()
paired_patients[patient_id] = psi_sets
You can use df.melt, pd.concat, series.str.split, df.replace, df.groupby and df.xs and then finally df.to_dict.
Please check out following:
>>> df2 = (pd.concat([
df.melt().variable.str.split('_', expand=True),
df.melt().drop('variable',1)
], axis=1)
.replace({'Normal':'n', 'Disease':'d'})
.groupby([0,1]).agg(list))
>>> paired_patients = {k: v for k, v in
df2.groupby(level=0)
.apply(lambda df: df.xs(df.name).value.to_dict())
.to_dict().items()
if not ({'d', 'n'} ^ v.keys())}
>>> paired_patients
{'Patient1': {'d': [0.12, 0.06, 0.19, 0.34, 'nan', 'nan', 0.73, 0.91],
'n': ['nan', 0.01, 0.1, 0.16, 0.88, 0.83, 0.82, 'nan']},
'Patient3': {'d': [0.11, 0.45, 'nan', 0.45, 0.22, 0.89, 0.17, 0.12],
'n': [0.21, 0.25,0.63,0.92,0.3, 0.56, 0.78, 0.9]},
'Patient4': {'nan', 'nan', 0.56, 0.72, 'nan', 0.97, 0.91, 0.79],
'n': ['nan', 0.35, 'nan', 0.22, 0.45, 0.66, 0.21, 0.91]}}
EXPLANTION:
>>> df.melt()
variable value
0 Patient1_Normal NaN
1 Patient1_Normal 0.01
2 Patient1_Normal 0.10
.. ... ...
62 Patient5_Disease 0.55
63 Patient5_Disease 0.51
>>> df.melt().variable.str.split('_', expand=True)
0 1
0 Patient1 Normal
1 Patient1 Normal
2 Patient1 Normal
.. ... ...
62 Patient5 Disease
63 Patient5 Disease
[64 rows x 2 columns]
# then concat these two, replace 'Normal' and 'Disease' with 'n' and 'd' and drop
# the 'variable' column
>>> pd.concat([
df.melt().variable.str.split('_', expand=True),
df.melt().drop('variable',1)
], axis=1).replace({'Normal':'n', 'Disease':'d'})
0 1 value
0 Patient1 n NaN
1 Patient1 n 0.01
2 Patient1 n 0.10
.. ... .. ...
62 Patient5 d 0.55
63 Patient5 d 0.51
[64 rows x 3 columns]
# then groupby column [0, 1] and aggregate into list:
>>> df2 = _.groupby([0,1]).agg(list)
>>> df2
value
0 1
Patient1 d [0.12, 0.06, 0.19, 0.34, nan, nan, 0.73, 0.91]
n [nan, 0.01, 0.1, 0.16, 0.88, 0.83, 0.82, nan]
Patient2 d [nan, nan, nan, 1.0, 0.24, 0.67, 0.97, 0.98]
Patient3 d [0.11, 0.45, nan, 0.45, 0.22, 0.89, 0.17, 0.12]
n [0.21, 0.25, 0.63, 0.92, 0.3, 0.56, 0.78, 0.9]
Patient4 d [nan, nan, 0.56, 0.72, nan, 0.97, 0.91, 0.79]
n [nan, 0.35, nan, 0.22, 0.45, 0.66, 0.21, 0.91]
Patient5 d [0.34, 0.27, nan, 0.16, 0.32, 0.27, 0.55, 0.51]
# Now groupby level=0, and convert that into dict, and finally check whether
# both 'n' and 'd' are present as keys by using symmetric set difference
# properties of dict_keys objects
>>> paired_patients = {k: v for k, v in
df2.groupby(level=0)
.apply(lambda df: df.xs(df.name).value.to_dict())
.to_dict().items()
if ('n' in v) and ('d' in v)}
I have two lists: data and given_x_axis
data=[[0.05, 3200], [0.1, 2000], [0.12, 1200], [0.13, 2000], [0.21, 1800], [0.25, 2800], [0.27, 1500]]
given_x_axis=[0.05, 0.07, 0.09, 0.11, 0.13, 0.15, 0.17, 0.19, 0.21, 0.23, 0.25, 0.27, 0.29, 0.31, 0.33, 0.35]
I want to plot a step chart with cumulative-sum like this,
x,y=map(list, zip(*np.cumsum(data, axis=0)))
plt.step(x,y)
but using given_x_axis instead as the steps on x axis
I have tried to define a function that recreates a new list of cumulative values based on the given_x_axis
def update_x_axis(data, given_x_axis):
cumulated_values=[]
value_each_step=0
for n,x in enumerate(given_x_axis):
for d in data:
if d[0]<=x:
value_each_step=value_each_step+d[1]
cumulated_values.append(value_each_step)
return [given_x_axis,cumulated_values]
But the new list of cumulative values on y axis does not seem to be correct.
I expect update_x_axis(data, given_x_axis) will return
[0.05, 0.07, 0.09, 0.11, 0.13, 0.15, 0.17, 0.19, 0.21, 0.23, 0.25, 0.27, 0.29, 0.31, 0.33, 0.35],
[3200, 3200, 3200, 5200, 6400, 8400....]]
How can I modify my defined function to do this?
I might misunderstand the question or the desired outcome. What I think you're looking for is this:
import numpy as np
import matplotlib.pyplot as plt
data=[[0.05, 3200], [0.1, 2000], [0.12, 1200], [0.13, 2000], [0.21, 1800], [0.25, 2800], [0.27, 1500]]
given_x_axis=[0.05, 0.07, 0.09, 0.11, 0.13, 0.15, 0.17, 0.19, 0.21, 0.23, 0.25, 0.27, 0.29, 0.31, 0.33, 0.35]
x,y = np.array(data).T
ind = np.searchsorted(x, given_x_axis, side="left")
ind[ind == 0] = 1
res = np.cumsum(y)[ind-1]
res is now
[ 3200. 3200. 3200. 5200. 6400. 8400. 8400. 8400. 8400. 10200.
10200. 13000. 14500. 14500. 14500. 14500.]
Then plotting,
fig, ax = plt.subplots()
ax.plot(x,np.cumsum(y), marker="o", ls="")
ax.step(given_x_axis, res)
plt.show()
Say I have a data set of the following format
Date Value
AAAA_Property1 2015/07/22 0.01
AAAA_Property1 2015/07/23 0.02
.
.
.
AAAA_Property2 2015/07/22 0.88
AAAA_Property2 2015/07/23 0.80
.
.
.
.
BBBB_Property1 2015/07/22 0.04
BBBB_Property1 2015/07/23 0.07
.
.
.
BBBB_Property2 2015/07/22 0.72
BBBB_Property2 2015/07/23 0.70
.
.
.
.
As you can see, every AAAA, BBBB, CCCC has a number of properties (property1, property2, etc), spreading overtime. (so it's really a 'cubic' data set)
Now I am wondering how we can group things together, and make the dataset look like the following
Name Date Property1 Property2 . . .
AAAA 2015/07/22 0.01 0.3
AAAA 2015/07/23 0.02 0.4
.
.
.
BBBB 2015/07/22 0.02 0.4
BBBB 2015/07/23 0.09 0.7
.
.
.
I need some research on numpy and its ndarray, and know that a reshape() method together with a where() method can probably be used to achieve the goal. But I couldn't put them together, since the data here is really 3-demensional.
However, I did find something else that looks very similar to what I am trying to accomplish. Please take a look at the following data
data = [
['Sulfate', 'Nitrate', 'EC', 'OC1', 'OC2', 'OC3', 'OP', 'CO', 'O3'],
('Basecase', [
[0.88, 0.01, 0.03, 0.03, 0.00, 0.06, 0.01, 0.00, 0.00],
[0.07, 0.95, 0.04, 0.05, 0.00, 0.02, 0.01, 0.00, 0.00],
[0.01, 0.02, 0.85, 0.19, 0.05, 0.10, 0.00, 0.00, 0.00],
[0.02, 0.01, 0.07, 0.01, 0.21, 0.12, 0.98, 0.00, 0.00],
[0.01, 0.01, 0.02, 0.71, 0.74, 0.70, 0.00, 0.00, 0.00]]),
('With CO', [
[0.88, 0.02, 0.02, 0.02, 0.00, 0.05, 0.00, 0.05, 0.00],
[0.08, 0.94, 0.04, 0.02, 0.00, 0.01, 0.12, 0.04, 0.00],
[0.01, 0.01, 0.79, 0.10, 0.00, 0.05, 0.00, 0.31, 0.00],
[0.00, 0.02, 0.03, 0.38, 0.31, 0.31, 0.00, 0.59, 0.00],
[0.02, 0.02, 0.11, 0.47, 0.69, 0.58, 0.88, 0.00, 0.00]]),
('With O3', [
[0.89, 0.01, 0.07, 0.00, 0.00, 0.05, 0.00, 0.00, 0.03],
[0.07, 0.95, 0.05, 0.04, 0.00, 0.02, 0.12, 0.00, 0.00],
[0.01, 0.02, 0.86, 0.27, 0.16, 0.19, 0.00, 0.00, 0.00],
[0.01, 0.03, 0.00, 0.32, 0.29, 0.27, 0.00, 0.00, 0.95],
[0.02, 0.00, 0.03, 0.37, 0.56, 0.47, 0.87, 0.00, 0.00]]),
('CO & O3', [
[0.87, 0.01, 0.08, 0.00, 0.00, 0.04, 0.00, 0.00, 0.01],
[0.09, 0.95, 0.02, 0.03, 0.00, 0.01, 0.13, 0.06, 0.00],
[0.01, 0.02, 0.71, 0.24, 0.13, 0.16, 0.00, 0.50, 0.00],
[0.01, 0.03, 0.00, 0.28, 0.24, 0.23, 0.00, 0.44, 0.88],
[0.02, 0.00, 0.18, 0.45, 0.64, 0.55, 0.86, 0.00, 0.16]])
]
I did a help(data), and it turns out to be a list object. But I am struggling on (1) how to get my data into the above format, and (2) how do I extra subset matrices from the above format and perform matrix operations from there.
some mock-up code that I have come up with ...
import pandas as pd
import numpy as np
df = pd.read_excel('xxx.xlsx', sheetname='1')
myMatrix = df.as_matrix()
myMatrix = myMatrix[:,:].astype(float)
# parse the first column and separate the AAAAs from the Property's
...
for current in ['AAAA','BBBB', ...]:
mySubMatrix = myMatrix.where(firstColumn == current)
mySubMatrix = mySubMatrix.reshape(numberOfDates, numberOfProperties)
#then append mySubMatrix to the new target matrix
I don't expect this code to run ... but that is the algorithm I could think of