Multiindex data.frame from two data.frames join by column headers - python

There are a dozens similar sounding questions here, I think I've searched them all and could not find a solution to my problem:
I have 2 df: df_c:
CAN-01 CAN-02 CAN-03
CE
ce1 0.84 0.73 0.50
ce2 0.06 0.13 0.05
And df_z:
CAN-01 CAN-02 CAN-03
marker
cell1 0.29 1.5 7
cell2 1.00 3.0 1
I want to join for each 'marker' + 'CE' combination over their column names
Example: cell1 + ce1:
[[0.29, 0.84],[1.5,0.73],[7,0.5], ...]
(Continuing for cell1 + ce2, cell2 + ce1, cell2 + ce2)
I have a working example using two loops and .loc twice, but it takes forever on the full data set.
I think the best to build is a multiindex DF with some merge/join/concat magic:
CAN-01 CAN-02 CAN-03
Source
0 CE 0.84 0.73 0.50
Marker 0.29 1.5 7
1 CE ...
Marker ...
Sample Code
dc = [['ce1', 0.84, 0.73, 0.5],['c2', 0.06,0.13,0.05]]
dat_c = pd.DataFrame(dc, columns=['CE', 'CAN-01', 'CAN-02', 'CAN-03'])
dat_c.set_index('CE',inplace=True)
dz = [['cell1', 0.29, 1.5, 7],['cell2', 1, 3, 1]]
dat_z = pd.DataFrame(dz, columns=['marker', "CAN-01", "CAN-02", "CAN-03"])
dat_z.set_index('marker',inplace=True)
Bad/Slow Solution
for ci, c_row in dat_c.iterrows(): # for each CE in CE table
tmp = []
for j,colz in enumerate(dat_z.columns[1:]):
if not colz in dat_c:
continue
entry_c = c_row.loc[colz]
if len(entry_c.shape) > 0:
continue
tmp.append([dat_z.loc[marker,colz],entry_c])

IIUC:
use append()+groupby():
dat_c.index=[f"cell{x+1}" for x in range(len(dat_c))]
df=dat_c.append(dat_z).groupby(level=0).agg(list)
output of df:
CAN-01 CAN-02 CAN-03
cell1 [0.84, 0.29] [0.73, 1.5] [0.5, 7.0]
cell2 [0.06, 1.0] [0.13, 3.0] [0.05, 1.0]
If needed list:
dat_c.index=[f"cell{x+1}" for x in range(len(dat_c))]
lst=dat_c.append(dat_z).groupby(level=0).agg(list).to_numpy().tolist()
output of lst:
[[[0.84, 0.29], [0.73, 1.5], [0.5, 7.0]],
[[0.06, 1.0], [0.13, 3.0], [0.05, 1.0]]]

Related

How to calculate correlation between adjacent columns throughout a dataframe and add it to the dataframe?

I'm new to NLP and text analysis; I have a dataframe of tokens and their tf-idf scores from some text data I am working with. Ex.
input
df=
|article |token1|token2|token3|token4|token5|
|article1|.00 |.04 |.03 |.00 |.10 |
|article2|.07 |.00 |.14 |.04 |.00 |
The tokens are in alphabetical order; I'm trying to get the correlation between adjacent columns throughout the dataframe and append it to the dataframe. The output would look something like this:
desired output
df=
|article |token1 |token2 |token3 |token4 |token5 |
|article1|.00 |.04 |.03 |.00 |.10 |
|article2|.07 |.00 |.14 |.04 |.00 |
|Corr |Corr1-2|Corr2-3|Corr3-4|Corr4-5|Nan |
I know that I could use df.corr(), but that won't yield the expected output. I would think that looping over columns could get there, but I'm not really sure where to start. Does anyone have an idea on how to achieve this?
Use:
df2 = df.set_index('article')
df2.loc['Corr'] = df2.corrwith(df2.shift(-1, axis=1))
print(df2)
token1 token2 token3 token4 token5
article
article1 0.00 0.04 0.03 0.00 0.1
article2 0.07 0.00 0.14 0.04 0.0
Corr -1.00 -1.00 1.00 -1.00 NaN
Random dataframe
df = pd.DataFrame({
"article": ["article1", "article2", "article3", "article4"],
"token1": [0.00, 0.03, 0.04, 0.00],
"token2": [0.07, 0.00, 0.01, 0.05],
"token3": [0.09, 0.08, 0.07, 0.06],
"token4": [0.00, 0.03, 0.05, 0.08],
"token5": [0.01, 0.04, 0.01, 0.02],
"token6": [0.00, 0.02, 0.04, 0.06],
})
calculate corr for sub dataframe
for i in range(2, len(df.columns)):
sub_df = df.iloc[:,[i-1, i]]
print(sub_df.columns)
print(sub_df.corr())
print("\n")
sample result
token3 0.195366 1.000000
Index(['token3', 'token4'], dtype='object')
token3 token4
token3 1.000000 -0.997054
token4 -0.997054 1.000000
Index(['token4', 'token5'], dtype='object')
token4 token5
token4 1.000000 0.070014
token5 0.070014 1.000000
Index(['token5', 'token6'], dtype='object')
token5 token6
token5 1.000000e+00 -9.897366e-17
token6 -9.897366e-17 1.000000e+00

Bar chart with ticks based on multiple dataframe columns

How can I make a bar chart in matplotlib (or pandas) from the bins in my dataframe?
I want something like this, below, where the x-axis labels come from the low, high in my dataframe (so first tick would read [-1.089, 0) and the y value is the percent column in my dataframe.
Here is an example dataset. The dataset is already in this format (I don't have an uncut version).
df = pd.DataFrame(
{
"low": [-1.089, 0, 0.3, 0.5, 0.6, 0.8],
"high": [0, 0.3, 0.5, 0.6, 0.8, 10.089],
"percent": [0.509, 0.11, 0.074, 0.038, 0.069, 0.202],
}
)
display(df)
Create a new column using the the low, high cols.
Covert the int values in the low and high columns to str type and set the new str in the [<low>, <high>) notation that you want.
From there, you can create a bar plot dirrectly from df using df.plot.bar(), assigning the newly created column as x and percent as y.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.bar.html
Recreate the bins using IntervalArray.from_arrays:
df['label'] = pd.arrays.IntervalArray.from_arrays(df.low, df.high)
# low high percent label
# 0 -1.089 0.000 0.509 (-1.089, 0.0]
# 1 0.000 0.300 0.110 (0.0, 0.3]
# 2 0.300 0.500 0.074 (0.3, 0.5]
# 3 0.500 0.600 0.038 (0.5, 0.6]
# 4 0.600 0.800 0.069 (0.6, 0.8]
# 5 0.800 10.089 0.202 (0.8, 10.089]
Then plot with x as these bins:
df.plot.bar(x='label', y='percent')

How can I get value from dataframe/matrix into tuple of list

I have a matrix that store values like table below:
play_tv
play_series
Null
purchase
Conversion
Start
0.02
0.03
0.04
0.05
0.06
play_series
0.07
0.08
0.09
0.10
0.11
play_tv
0.12
0.13
0.14
0.15
0.16
Null
0.17
0.18
0.19
0.20
0.21
purchase
0.22
0.23
0.24
0.25
0.26
Conversion
0.27
0.28
0.29
0.30
0.31
and I have dataframe like this below:
session_id
path
path_pair
T01
[Start, play_series, Null]
[(Start, play_series),( play_series, Null)]
T02
[Start, play_tv, purchase, Conversion]
[(Start, play_tv),(play_tv, purchase),(purchase, Conversion)]
I want to get value from the matrix to replace column path_pair or create new column in my current dataframe. It's choose be list of values and How can I do that?
[(Start, play_series), (play_series, Null)] -> [0.03, 0.09]
[(Start, play_tv), (play_tv, purchase), (purchase, conversion)] -> [0.02, 0.15, 0.26 ]
result I want:
session_id
path
path_pair
T01
[Start, play_series, Null]
[0.03, 0.09]
T02
[Start, play_tv, purchase, Conversion]
[0.02, 0.15, 0.26]
script I try to get value from the matrix:
trans_matrix[trans_matrix.index=="Start"]["play_series"].values[0]
Given your input:
df1 = pd.DataFrame({'play_tv': [0.02, 0.07, 0.12, 0.17, 0.22, 0.27],
'play_series': [0.03, 0.08, 0.13, 0.18, 0.23, 0.28],
'Null': [0.04, 0.09, 0.14, 0.19, 0.24, 0.29],
'purchase': [0.05, 0.1, 0.15, 0.2, 0.25, 0.3],
'Conversion': [0.06, 0.11, 0.16, 0.21, 0.26, 0.31]},
index=['Start','play_series','play_tv','Null','purchase','Conversion'])
df2 = pd.DataFrame({'session_id': ['T01', 'T02'],
'path': [['Start', 'play_series', 'Null'],
['Start', 'play_tv', 'purchase', 'Conversion']],
'path_pair': [[('Start', 'play_series'),( 'play_series', 'Null')],
[('Start', 'play_tv'),('play_tv', 'purchase'),('purchase', 'Conversion')]]})
You can update df2 by applying a function to column 'path_pair' that looks up values in df1:
df2['path_pair'] = df2['path_pair'].apply(lambda lst: [df1.loc[x,y] for (x,y) in lst])
Output:
session_id path path_pair
0 T01 [Start, play_series, Null] [0.03, 0.09]
1 T02 [Start, play_tv, purchase, Conversion] [0.02, 0.15, 0.26]

get means and SEM in one df with pandas groupby

I'd like to find an efficient way to use the df.groupby() function in pandas to return both the means and standard deviations of a data frame - preferably in one shot!
import pandas as PD
df = pd.DataFrame({'case':[1, 1, 2, 2, 3, 3],
'condition':[1,2,1,2,1,2],
'var_a':[0.92, 0.88, 0.90, 0.79, 0.94, 0.85],
'var_b':[0.21, 0.15, 0.1, 0.16, 0.17, 0.23]})
with that data, I'd like an easier way (if there is one!) to perform the following:
grp_means = df.groupby('case', as_index=False).mean()
grp_sems = df.groupby('case', as_index=False).sem()
grp_means.rename(columns={'var_a':'var_a_mean', 'var_b':'var_b_mean'},
inplace=True)
grp_sems.rename(columns={'var_a':'var_a_SEM', 'var_b':'var_b_SEM'},
inplace=True)
grouped = pd.concat([grp_means, grp_sems[['var_a_SEM', 'var_b_SEM']]], axis=1)
grouped
Out[1]:
case condition var_a_mean var_b_mean var_a_SEM var_b_SEM
0 1 1.5 0.900 0.18 0.900 0.18
1 2 1.5 0.845 0.13 0.845 0.13
2 3 1.5 0.895 0.20 0.895 0.20
I also recently learned of the .agg() function, and tried df.groupby('grouper column') agg('var':'mean', 'var':sem') but this just returns a SyntaxError.
I think need DataFrameGroupBy.agg, but then remove column ('condition','sem') and map for convert MultiIndex to columns:
df = df.groupby('case').agg(['mean','sem']).drop(('condition','sem'), axis=1)
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
case condition_mean var_a_mean var_a_sem var_b_mean var_b_sem
0 1 1.5 0.900 0.020 0.18 0.03
1 2 1.5 0.845 0.055 0.13 0.03
2 3 1.5 0.895 0.045 0.20 0.03

Getting values out of a python dictionary

I am struggling with python dictionaries. I created a dictionary that looks like:
d = {'0.500': ['18.4 0.5', '17.9 0.4', '16.9 0.4', '18.6 0.4'],
'1.000': ['14.8 0.5', '14.9 0.5', '15.6 0.4', '15.9 0.3'],
'0.000': ['23.2 0.5', '23.2 0.8', '23.2 0.7', '23.2 0.1']}
and I would like to end up having:
0.500 17.95 0.425
which is the key, average of (18.4+17.9+16.9+18.6), average of (0.5+0.4+0.4+0.4)
(and the same for 1.000 and 0.000 with their corresponding averages)
Initially my dictionary had only two values, so I could rely on indexes:
for key in d:
dvdl1 = d[key][0].split(" ")[0]
dvdl2 = d[key][1].split(" ")[0]
average = ((float(dvdl1)+float(dvdl2))/2)
but now I would like to have my code working for different dictionary lengths with lets say 4 (example above) or 5 or 6 values each...
Cheers!
for k,v in d.iteritems():
col1, col2 = zip(*[map(float,x.split()) for x in v])
print k, sum(col1)/len(v), sum(col2)/len(v)
...
0.500 17.95 0.425
1.000 15.3 0.425
0.000 23.2 0.525
How this works:
>>> v = ['18.4 0.5', '17.9 0.4', '16.9 0.4', '18.6 0.4']
first split each item at white-spaces and apply float to them, so we get a lists of lists:
>>> zipp = [map(float,x.split()) for x in v]
>>> zipp
[[18.4, 0.5], [17.9, 0.4], [16.9, 0.4], [18.6, 0.4]] #list of rows
Now we can use zip with * which acts as un-zipping and we will get a list of columns.
>>> zip(*zipp)
[(18.4, 17.9, 16.9, 18.6), (0.5, 0.4, 0.4, 0.4)]

Categories

Resources