How to reshape/"topple" a pandas dataframe - python

Topple is most likely the wrong name for the operation I want, but I cannot think of a better one.
I have N dataframes of shape (100,3), Each row of the original dataframe is the name of a test and the two results it produces. I want to reshape a single dataframe to a (1,200) shape, with all of the values of the tests as a single row. After that I'll append all of the N dataframes into a single one, ending with a (N,200) dataframe.
Here's an example with dummy data:
import pandas as pd
import numpy as np
import random
import string
np.random.seed(42)
tests = np.random.choice(list(string.ascii_letters),size=(100,1))
results = np.random.randint(0,100,size=(100, 2))
df = pd.DataFrame(np.concatenate([tests, results], axis=1), columns=["Test Name", "ValueA", "ValueB"])
toppled_df = pd.DataFrame(np.random.randint(0,100,size=(1,5)),columns=["Z Value A", "Z ValueB", "t ValueA", "t ValueB", "..."])
toppled_df = pd.DataFrame([[44,64,88,70,"..."]],columns=["M Value A", "M ValueB", "Z ValueA", "Z ValueB", "..."])
toppled_df.head()

A more pythonic way
df_out = df.set_index('Test Name').stack().to_frame().T
df_out.columns = df_out.columns.map(' '.join).str.strip()

You can melt df to a long format, join the test name and value type columns, then transpose it.
tests = np.random.choice(list(string.ascii_letters),size=(100,1))
results = np.random.randint(0,100,size=(100, 2))
df = pd.DataFrame(np.concatenate([tests, results], axis=1),
columns=["Test Name", "ValueA", "ValueB"])
df2['key'] = df['Test Name'] + ' ' + df['variable']
df2['key'] = df2['Test Name'] + ' ' + df2['variable']
df2[['key', 'value']].set_index('key').T
Loop through each dataframe to create the melted dataframe and then concatenate.

df2 = df.set_index('Test Name').unstack()
result = pd.DataFrame(data=df2.values.reshape(1,-1), columns=df2.index)
Output:
>>> result
ValueA ... ValueB
Test Name M Z C o Q h u M s w k k x J ... w N u p S r U x z y S O C o
0 44 88 8 0 87 10 7 34 4 27 72 11 32 22 ... 49 30 41 6 89 1 47 68 31 98 47 2 23 32
You can access individual results like this:
result['ValueA', 'M']
# or
result['ValueA']['M']

Related

Converting pandas.core.series.Series to dataframe with multiple column names

My toy example is as follows:
import numpy as np
from sklearn.datasets import load_iris
import pandas as pd
### prepare data
Xy = np.c_[load_iris(return_X_y=True)]
mycol = ['x1','x2','x3','x4','group']
df = pd.DataFrame(data=Xy, columns=mycol)
dat = df.iloc[:100,:] #only consider two species
dat['group'] = dat.group.apply(lambda x: 1 if x ==0 else 2) #two species means two groups
dat.shape
dat.head()
### Linear discriminant analysis procedure
G1 = dat.iloc[:50,:-1]; x1_bar = G1.mean(); S1 = G1.cov(); n1 = G1.shape[0]
G2 = dat.iloc[50:,:-1]; x2_bar = G2.mean(); S2 = G2.cov(); n2 = G2.shape[0]
Sp = (n1-1)/(n1+n2-2)*S1 + (n2-1)/(n1+n2-2)*S2
a = np.linalg.inv(Sp).dot(x1_bar-x2_bar); u_bar = (x1_bar + x2_bar)/2
m = a.T.dot(u_bar); print("Linear discriminant boundary is {} ".format(m))
def my_lda(x):
y = a.T.dot(x)
pred = 1 if y >= m else 2
return y.round(4), pred
xx = dat.iloc[:,:-1]
xxa = xx.agg(my_lda, axis=1)
xxa.shape
type(xxa)
We have xxa is a pandas.core.series.Series with shape (100,). Note that there are two columns in parentheses of xxa, I want convert xxa to a pd.DataFrame with 100 rows x 2 columns and I try
xxa_df1 = pd.DataFrame(data=xxa, columns=['y','pred'])
which gives ValueError: Shape of passed values is (100, 1), indices imply (100, 2).
Then I continue to try
xxa2 = xxa.to_frame()
# xxa2 = pd.DataFrame(xxa) #equals `xxa.to_frame()`
xxa_df2 = pd.DataFrame(data=xxa2, columns=['y','pred'])
and xxa_df2 presents all NaN with 100 rows x 2 columns. What should I do next?
Let's try Series.tolist()
xxa_df1 = pd.DataFrame(data=xxa.tolist(), columns=['y','pred'])
print(xxa_df1)
y pred
0 42.0080 1
1 32.3859 1
2 37.5566 1
3 31.0958 1
4 43.5050 1
.. ... ...
95 -56.9613 2
96 -61.8481 2
97 -62.4983 2
98 -38.6006 2
99 -61.4737 2
[100 rows x 2 columns]

Evaluating an Expression using data frames

I have a df
Patient ID
A 72
A SD75
A 74
A 74
B 71
C 72
And
I have an expression
exp = '((71+72)*((73+75)+SD75))*((74+76)+SD76))'
Now I need to evaluate this expression with ones and zeros if there's a match in the df for each of the three patients A, B , C . A has a match with ID 72, SD75, 74 so the expressions should be
A- '((0+1)*((0+0)+1))*((1+0)+0))'
B- '((1+0)*((0+0)+0))*((0+0)+0))'
C- '((0+1)*((0+0)+0))*((0+0)+0))'
And My final df_output should look like this
Patient FinalVal
A 1
B 0
C 0
The FinalVal can be obtained by eval(exp) after replacing the ID's with 1's and O's
so Far here is where I reached. When I am replacing the ID 75 with 0 the SD75 is becoming SD0 and that's where I am stuck
import pandas as pd
import re
exp = '((71+72)*((73+75)+SD75))*((74+76)+SD76))'
mylist = re.sub(r'[^\w]', ' ', exp).split()
distinct_pt = df.Patient.drop_duplicates().dropna()
df_output = pd.DataFrame(distinct_pt)
df_output['Exp'] = exp
for index, row in df_output.iterrows():
new_df = df[df.Patient == row['Patient']]
new_dfl = new_df['ID'].tolist()
#print(new_dfl)
for j in mylist:
if j in new_dfl:
#print(j)
row['Exp'] = row['Exp'].replace(j,'1')
else:
row['Exp'] = row['Exp'].replace(j,'1')
We can try creating an indicator DataFrame using a Series.get_dummies to create indicator columns for each value in the ID column, then reduce to a single row per Patient via groupby max:
# Convert to ID columns to binary indicators
indicator_df = df.set_index('Patient')['ID'].str.get_dummies()
# Reduce to 1 row per Patient
indicator_df = indicator_df.groupby(level=0).max()
indicator_df:
71 72 74 SD75
Patient
A 0 1 1 1
B 1 0 0 0
C 0 1 0 0
Now we can reindex from the expression terms to create missing columns. np.unique is used to ensure that duplicate terms in the expression do not result in duplicate columns in indicator_df (this can be omitted if it is guaranteed there are no duplicate terms):
exp = '(((71+72)*((73+75)+SD75))*((74+76)+SD76))'
# Extract terms from expression
cols = re.sub(r'[^\w]', ' ', exp).split()
# Convert to ID columns to binary indicators
indicator_df = df.set_index('Patient')['ID'].str.get_dummies()
# Reduce to 1 row per Patient
indicator_df = indicator_df.groupby(level=0).max()
# Ensure All expression terms are present
indicator_df = indicator_df.reindex(
columns=np.unique(cols), # prevent duplicate cols
fill_value=0 # Added cols are filled with 0
)
indicator_df:
71 72 73 74 75 76 SD75 SD76
Patient
A 0 1 0 1 0 0 1 0
B 1 0 0 0 0 0 0 0
C 0 1 0 0 0 0 0 0
Now if we alter the exp slightly by surrounding these new columns names with backticks (`) we can use DataFrame.eval to compute the expression:
exp = '(((71+72)*((73+75)+SD75))*((74+76)+SD76))'
# Extract terms from expression
cols = re.sub(r'[^\w]', ' ', exp).split()
# create indicator_df (chained)
indicator_df = (
df.set_index('Patient')['ID']
.str.get_dummies()
.groupby(level=0).max()
.reindex(columns=np.unique(cols), fill_value=0)
)
# Eval the expression and create the resulting DataFrame
result = indicator_df.eval(
# Add Backticks around columns names
re.sub(r'(\w+)', r'`\1`', exp)
).reset_index(name='FinalVal')
result:
Patient FinalVal
0 A 1
1 B 0
2 C 0
The backticks are necessary to indicate these values represent column names, and not numeric values:
re.sub(r'(\w+)', r'`\1`', exp)
# (((`71`+`72`)*((`73`+`75`)+`SD75`))*((`74`+`76`)+`SD76`))
Notice the difference between 71 with backticks vs without:
# Column '71' + the number 71
pd.DataFrame({'71': [1, 2, 3]}).eval('B = `71` + 71')
71 B
0 1 72
1 2 73
2 3 74
Alternatively, the indicator_df can be created with a crosstab and clip:
exp = '(((71+72)*((73+75)+SD75))*((74+76)+SD76))'
# Extract terms from expression
cols = re.sub(r'[^\w]', ' ', exp).split()
indicator_df = (
pd.crosstab(df['Patient'], df['ID'])
.clip(upper=1) # Restrict upperbound to 1
.reindex(columns=np.unique(cols), fill_value=0)
)
# Eval the expression and create the resulting DataFrame
result = indicator_df.eval(
# Add Backticks around columns names
re.sub(r'(\w+)', r'`\1`', exp)
).reset_index(name='FinalVal')
Setup and imports used:
import re
import numpy as np
import pandas as pd
df = pd.DataFrame({
'Patient': ['A', 'A', 'A', 'A', 'B', 'C'],
'ID': ['72', 'SD75', '74', '74', '71', '72']
})
I would not try to parse that expression and evaluate it. Instead, I would create dummy or indicator variables for the ID column. (Indicator variables are also called one-hot encoded variables.) With these indicators, you can then calculate your expression using a standard function.
Here's how to do it with Pandas and scikit-learn. I am using scikit-learn's OneHotEncoder. An alternative might be Panda's get_dummies(), but the OneHotEncoder allows you to specify the categories.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
variables = [71, 72, 73, 74, 75, 76, "SD75", "SD76"]
enc = OneHotEncoder(categories=[variables], sparse=False)
df = pd.DataFrame({
"Patient": ["A"] * 4 + ["B", "C"],
"ID": [72, "SD75", 74, 74, 71, 72]
})
# Create one-hot encoded variables, also called dummy or indicator variables
df_one_hot = pd.DataFrame(
enc.fit_transform(df[["ID"]]),
columns=variables,
index=df.Patient
)
# Aggregate dummy or one-hot variables, so there's one for each patient
# You may need to alter the aggretaion function
# I chose max because it matched your example
# but perhaps sum might be better (e.g. patient A has two entires for 74, should that be a value of 2 for variable 74?
one_hot_patient = df_one_hot.groupby(level="Patient").agg(max)
# Finally, evaluate your expression
# Create a function to calcualte the output given a data frame
def my_expr(DF):
out = (DF[71] + DF[72]) \
* (DF[73] + DF[75] + DF["SD75"]) \
* (DF[74]+DF[76]+DF["SD76"])
return out
output = one_hot_patient.assign(FinalVal=my_expr)
Result
71 72 73 74 75 76 SD75 SD76 FinalVal
Patient
A 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0
B 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
C 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Using sub instead of replace should work:
for j in mylist:
if j in new_dfl:
exp = re.sub(r'\b{}'.format(j), '1', exp)
else:
exp = re.sub(r'\b{}'.format(j), '0', exp)
Another way that would work for this exact scenario is to sort mylist in descending order so the items preceded by SD are iterated before the others.
mylist = re.sub(r'[^\w]', ' ', exp).split()
mylist.sort(reverse=True)

How to loop through a pandas dataframe to run an independent ttest for each of the variables?

I have a dataset that consists of around 33 variables. The dataset contains patient information and the outcome of interest is binary in nature. Below is a snippet of the data.
The dataset is stored as a pandas dataframe
df.head()
ID Age GAD PHQ Outcome
1 23 17 23 1
2 54 19 21 1
3 61 23 19 0
4 63 16 13 1
5 37 14 8 0
I want to run independent t-tests looking at the differences in patient information based on outcome. So, if I were to run a t-test for each alone, I would do:
age_neg_outcome = df.loc[df.outcome ==0, ['Age']]
age_pos_outcome = df.loc[df.outcome ==1, ['Age']]
t_age, p_age = stats.ttest_ind(age_neg_outcome ,age_pos_outcome, unequal = True)
print('\t Age: t= ', t_age, 'with p-value= ', p_age)
How can I do this in a for loop for each of the variables?
I've seen this post which is slightly similar but couldn't manage to use it.
Python : T test ind looping over columns of df
You are almost there. ttest_ind accepts multi-dimensional arrays too:
cols = ['Age', 'GAD', 'PHQ']
cond = df['outcome'] == 0
neg_outcome = df.loc[cond, cols]
pos_outcome = df.loc[~cond, cols]
# The unequal parameter is invalid so I'm leaving it out
t, p = stats.ttest_ind(neg_outcome, pos_outcome)
for i, col in enumerate(cols):
print(f'\t{col}: t = {t[i]:.5f}, with p-value = {p[i]:.5f}')
Output:
Age: t = 0.12950, with p-value = 0.90515
GAD: t = 0.32937, with p-value = 0.76353
PHQ: t = -0.96683, with p-value = 0.40495

How to check correlation between matching columns of two data sets?

If we have the data set:
import pandas as pd
a = pd.DataFrame({"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]})
b = pd.DataFrame({"A":[45,24,65,65,65], "B":[45,87,65,52,12], "C":[98,52,32,32,12], "D":[0,23,1,365,53], "E":[24,12,65,3,65]})
How does one create a correlation matrix, in which the y-axis represents "a" and the x-axis represents "b"?
The aim is to see correlations between the matching columns of the two datasets like this:
If you won't mind a NumPy based vectorized solution, based on this solution post to Computing the correlation coefficient between two multi-dimensional arrays -
corr2_coeff(a.values.T,b.values.T).T # func from linked solution post.
Sample run -
In [621]: a
Out[621]:
A B C D E
0 34 54 56 0 78
1 12 87 78 23 12
2 78 35 0 72 31
3 84 25 14 56 0
4 26 82 13 14 34
In [622]: b
Out[622]:
A B C D E
0 45 45 98 0 24
1 24 87 52 23 12
2 65 65 32 1 65
3 65 52 32 365 3
4 65 12 12 53 65
In [623]: corr2_coeff(a.values.T,b.values.T).T
Out[623]:
array([[ 0.71318502, -0.5923714 , -0.9704441 , 0.48775228, -0.07401011],
[ 0.0306753 , -0.0705457 , 0.48801177, 0.34685977, -0.33942737],
[-0.26626431, -0.01983468, 0.66110713, -0.50872017, 0.68350413],
[ 0.58095645, -0.55231196, -0.32053858, 0.38416478, -0.62403866],
[ 0.01652716, 0.14000468, -0.58238879, 0.12936016, 0.28602349]])
This achieves exactly what you want:
from scipy.stats import pearsonr
# create a new DataFrame where the values for the indices and columns
# align on the diagonals
c = pd.DataFrame(columns = a.columns, index = a.columns)
# since we know set(a.columns) == set(b.columns), we can just iterate
# through the columns in a (although a more robust way would be to iterate
# through the intersection of the two sets of columns, in the case your actual dataframes' columns don't match up
for col in a.columns:
correl_signif = pearsonr(a[col], b[col]) # correlation of those two Series
correl = correl_signif[0] # grab the actual Pearson R value from the tuple from above
c.loc[col, col] = correl # locate the diagonal for that column and assign the correlation coefficient
Edit: Well, it achieved exactly what you wanted, until the question was modified. Although this can easily be changed:
c = pd.DataFrame(columns = a.columns, index = a.columns)
for col in c.columns:
for idx in c.index:
correl_signif = pearsonr(a[col], b[idx])
correl = correl_signif[0]
c.loc[idx, col] = correl
c is now this:
Out[16]:
A B C D E
A 0.713185 -0.592371 -0.970444 0.487752 -0.0740101
B 0.0306753 -0.0705457 0.488012 0.34686 -0.339427
C -0.266264 -0.0198347 0.661107 -0.50872 0.683504
D 0.580956 -0.552312 -0.320539 0.384165 -0.624039
E 0.0165272 0.140005 -0.582389 0.12936 0.286023
I use this function that breaks it down with numpy
def corr_ab(a, b):
a_ = a.values
b_ = b.values
ab = a_.T.dot(b_)
n = len(a)
sums_squared = np.outer(a_.sum(0), b_.sum(0))
stds_squared = np.outer(a_.std(0), b_.std(0))
return pd.DataFrame((ab - sums_squared / n) / stds_squared / n,
a.columns, b.columns)
demo
corr_ab(a, b)
Do you have to use Pandas? This seem can be done via numpy rather easily. Did i understand the task incorrectly?
import numpy
X = {"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]}
Y = {"A":[45,24,65,65,65], "B":[45,87,65,52,12], "C":[98,52,32,32,12], "D":[0,23,1,365,53], "E":[24,12,65,3,65]}
for key,value in X.items():
print "correlation stats for %s is %s" % (key, numpy.corrcoef(value,Y[key]))

Python: fast subsetting and looping dataframe

I have the folowing minimal code which is too slow. For the 1000 rows I need, it takes about 2 min. I need it to run faster.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,1000,size=(1000, 4)), columns=list('ABCD'))
start_algorithm = time.time()
myunique = df['D'].unique()
for i in myunique:
itemp = df[df['D'] == i]
for j in myunique:
jtemp = df[df['D'] == j]
I know that numpy can make it run much faster but keep in mind that I want to keep a part of the original dataframe (or array in numpy) for specific values of column 'D'. How can I improve its performance?
Avoid computing the sub-DataFrame df[df['D'] == i] more than once. The original code computes this len(myunique)**2 times. Instead you can compute this once for each i (that is, len(myunique) times in total), store the results, and then pair them together later. For example,
groups = [grp for di, grp in df.groupby('D')]
for itemp, jtemp in IT.product(groups, repeat=2):
pass
import pandas as pd
import itertools as IT
df = pd.DataFrame(np.random.randint(0,1000,size=(1000, 4)), columns=list('ABCD'))
def using_orig():
myunique = df['D'].unique()
for i in myunique:
itemp = df[df['D'] == i]
for j in myunique:
jtemp = df[df['D'] == j]
def using_groupby():
groups = [grp for di, grp in df.groupby('D')]
for itemp, jtemp in IT.product(groups, repeat=2):
pass
In [28]: %timeit using_groupby()
10 loops, best of 3: 63.8 ms per loop
In [31]: %timeit using_orig()
1 loop, best of 3: 2min 22s per loop
Regarding the comment:
I can easily replace itemp and jtemp with a=1 or print "Hello" so ignore that
The answer above addresses how to compute itemp and jtemp more efficiently. If itemp and jtemp are not central to your real calculation, then we would need to better understand what you really want to compute in order to suggest (if possible) a way to compute it faster.
Here's a vectorized approach to form the groups based on unique elements from "D" column -
# Sort the dataframe based on the sorted indices of column 'D'
df_sorted = df.iloc[df['D'].argsort()]
# In the sorted dataframe's 'D' column find the shift/cut indces
# (places where elements change values, indicating change of groups).
# Cut the dataframe at those indices for the final groups with NumPy Split.
cut_idx = np.where(np.diff(df_sorted['D'])>0)[0]+1
df_split = np.split(df_sorted,cut_idx)
Sample testing
1] Form a sample dataframe with random elements :
>>> df = pd.DataFrame(np.random.randint(0,100,size=(5, 4)), columns=list('ABCD'))
>>> df
A B C D
0 68 68 90 39
1 53 99 20 85
2 64 76 21 19
3 90 91 32 36
4 24 9 89 19
2] Run the original code and print the results :
>>> myunique = df['D'].unique()
>>> for i in myunique:
... itemp = df[df['D'] == i]
... print itemp
...
A B C D
0 68 68 90 39
A B C D
1 53 99 20 85
A B C D
2 64 76 21 19
4 24 9 89 19
A B C D
3 90 91 32 36
3] Run the proposed code and print the results :
>>> df_sorted = df.iloc[df['D'].argsort()]
>>> cut_idx = np.where(np.diff(df_sorted['D'])>0)[0]+1
>>> df_split = np.split(df_sorted,cut_idx)
>>> for split in df_split:
... print split
...
A B C D
2 64 76 21 19
4 24 9 89 19
A B C D
3 90 91 32 36
A B C D
0 68 68 90 39
A B C D
1 53 99 20 85

Categories

Resources