There are a dozens similar sounding questions here, I think I've searched them all and could not find a solution to my problem:
I have 2 df: df_c:
CAN-01 CAN-02 CAN-03
CE
ce1 0.84 0.73 0.50
ce2 0.06 0.13 0.05
And df_z:
CAN-01 CAN-02 CAN-03
marker
cell1 0.29 1.5 7
cell2 1.00 3.0 1
I want to join for each 'marker' + 'CE' combination over their column names
Example: cell1 + ce1:
[[0.29, 0.84],[1.5,0.73],[7,0.5], ...]
(Continuing for cell1 + ce2, cell2 + ce1, cell2 + ce2)
I have a working example using two loops and .loc twice, but it takes forever on the full data set.
I think the best to build is a multiindex DF with some merge/join/concat magic:
CAN-01 CAN-02 CAN-03
Source
0 CE 0.84 0.73 0.50
Marker 0.29 1.5 7
1 CE ...
Marker ...
Sample Code
dc = [['ce1', 0.84, 0.73, 0.5],['c2', 0.06,0.13,0.05]]
dat_c = pd.DataFrame(dc, columns=['CE', 'CAN-01', 'CAN-02', 'CAN-03'])
dat_c.set_index('CE',inplace=True)
dz = [['cell1', 0.29, 1.5, 7],['cell2', 1, 3, 1]]
dat_z = pd.DataFrame(dz, columns=['marker', "CAN-01", "CAN-02", "CAN-03"])
dat_z.set_index('marker',inplace=True)
Bad/Slow Solution
for ci, c_row in dat_c.iterrows(): # for each CE in CE table
tmp = []
for j,colz in enumerate(dat_z.columns[1:]):
if not colz in dat_c:
continue
entry_c = c_row.loc[colz]
if len(entry_c.shape) > 0:
continue
tmp.append([dat_z.loc[marker,colz],entry_c])
IIUC:
use append()+groupby():
dat_c.index=[f"cell{x+1}" for x in range(len(dat_c))]
df=dat_c.append(dat_z).groupby(level=0).agg(list)
output of df:
CAN-01 CAN-02 CAN-03
cell1 [0.84, 0.29] [0.73, 1.5] [0.5, 7.0]
cell2 [0.06, 1.0] [0.13, 3.0] [0.05, 1.0]
If needed list:
dat_c.index=[f"cell{x+1}" for x in range(len(dat_c))]
lst=dat_c.append(dat_z).groupby(level=0).agg(list).to_numpy().tolist()
output of lst:
[[[0.84, 0.29], [0.73, 1.5], [0.5, 7.0]],
[[0.06, 1.0], [0.13, 3.0], [0.05, 1.0]]]
Related
I'm new to NLP and text analysis; I have a dataframe of tokens and their tf-idf scores from some text data I am working with. Ex.
input
df=
|article |token1|token2|token3|token4|token5|
|article1|.00 |.04 |.03 |.00 |.10 |
|article2|.07 |.00 |.14 |.04 |.00 |
The tokens are in alphabetical order; I'm trying to get the correlation between adjacent columns throughout the dataframe and append it to the dataframe. The output would look something like this:
desired output
df=
|article |token1 |token2 |token3 |token4 |token5 |
|article1|.00 |.04 |.03 |.00 |.10 |
|article2|.07 |.00 |.14 |.04 |.00 |
|Corr |Corr1-2|Corr2-3|Corr3-4|Corr4-5|Nan |
I know that I could use df.corr(), but that won't yield the expected output. I would think that looping over columns could get there, but I'm not really sure where to start. Does anyone have an idea on how to achieve this?
Use:
df2 = df.set_index('article')
df2.loc['Corr'] = df2.corrwith(df2.shift(-1, axis=1))
print(df2)
token1 token2 token3 token4 token5
article
article1 0.00 0.04 0.03 0.00 0.1
article2 0.07 0.00 0.14 0.04 0.0
Corr -1.00 -1.00 1.00 -1.00 NaN
Random dataframe
df = pd.DataFrame({
"article": ["article1", "article2", "article3", "article4"],
"token1": [0.00, 0.03, 0.04, 0.00],
"token2": [0.07, 0.00, 0.01, 0.05],
"token3": [0.09, 0.08, 0.07, 0.06],
"token4": [0.00, 0.03, 0.05, 0.08],
"token5": [0.01, 0.04, 0.01, 0.02],
"token6": [0.00, 0.02, 0.04, 0.06],
})
calculate corr for sub dataframe
for i in range(2, len(df.columns)):
sub_df = df.iloc[:,[i-1, i]]
print(sub_df.columns)
print(sub_df.corr())
print("\n")
sample result
token3 0.195366 1.000000
Index(['token3', 'token4'], dtype='object')
token3 token4
token3 1.000000 -0.997054
token4 -0.997054 1.000000
Index(['token4', 'token5'], dtype='object')
token4 token5
token4 1.000000 0.070014
token5 0.070014 1.000000
Index(['token5', 'token6'], dtype='object')
token5 token6
token5 1.000000e+00 -9.897366e-17
token6 -9.897366e-17 1.000000e+00
How can I make a bar chart in matplotlib (or pandas) from the bins in my dataframe?
I want something like this, below, where the x-axis labels come from the low, high in my dataframe (so first tick would read [-1.089, 0) and the y value is the percent column in my dataframe.
Here is an example dataset. The dataset is already in this format (I don't have an uncut version).
df = pd.DataFrame(
{
"low": [-1.089, 0, 0.3, 0.5, 0.6, 0.8],
"high": [0, 0.3, 0.5, 0.6, 0.8, 10.089],
"percent": [0.509, 0.11, 0.074, 0.038, 0.069, 0.202],
}
)
display(df)
Create a new column using the the low, high cols.
Covert the int values in the low and high columns to str type and set the new str in the [<low>, <high>) notation that you want.
From there, you can create a bar plot dirrectly from df using df.plot.bar(), assigning the newly created column as x and percent as y.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.bar.html
Recreate the bins using IntervalArray.from_arrays:
df['label'] = pd.arrays.IntervalArray.from_arrays(df.low, df.high)
# low high percent label
# 0 -1.089 0.000 0.509 (-1.089, 0.0]
# 1 0.000 0.300 0.110 (0.0, 0.3]
# 2 0.300 0.500 0.074 (0.3, 0.5]
# 3 0.500 0.600 0.038 (0.5, 0.6]
# 4 0.600 0.800 0.069 (0.6, 0.8]
# 5 0.800 10.089 0.202 (0.8, 10.089]
Then plot with x as these bins:
df.plot.bar(x='label', y='percent')
I have a matrix that store values like table below:
play_tv
play_series
Null
purchase
Conversion
Start
0.02
0.03
0.04
0.05
0.06
play_series
0.07
0.08
0.09
0.10
0.11
play_tv
0.12
0.13
0.14
0.15
0.16
Null
0.17
0.18
0.19
0.20
0.21
purchase
0.22
0.23
0.24
0.25
0.26
Conversion
0.27
0.28
0.29
0.30
0.31
and I have dataframe like this below:
session_id
path
path_pair
T01
[Start, play_series, Null]
[(Start, play_series),( play_series, Null)]
T02
[Start, play_tv, purchase, Conversion]
[(Start, play_tv),(play_tv, purchase),(purchase, Conversion)]
I want to get value from the matrix to replace column path_pair or create new column in my current dataframe. It's choose be list of values and How can I do that?
[(Start, play_series), (play_series, Null)] -> [0.03, 0.09]
[(Start, play_tv), (play_tv, purchase), (purchase, conversion)] -> [0.02, 0.15, 0.26 ]
result I want:
session_id
path
path_pair
T01
[Start, play_series, Null]
[0.03, 0.09]
T02
[Start, play_tv, purchase, Conversion]
[0.02, 0.15, 0.26]
script I try to get value from the matrix:
trans_matrix[trans_matrix.index=="Start"]["play_series"].values[0]
Given your input:
df1 = pd.DataFrame({'play_tv': [0.02, 0.07, 0.12, 0.17, 0.22, 0.27],
'play_series': [0.03, 0.08, 0.13, 0.18, 0.23, 0.28],
'Null': [0.04, 0.09, 0.14, 0.19, 0.24, 0.29],
'purchase': [0.05, 0.1, 0.15, 0.2, 0.25, 0.3],
'Conversion': [0.06, 0.11, 0.16, 0.21, 0.26, 0.31]},
index=['Start','play_series','play_tv','Null','purchase','Conversion'])
df2 = pd.DataFrame({'session_id': ['T01', 'T02'],
'path': [['Start', 'play_series', 'Null'],
['Start', 'play_tv', 'purchase', 'Conversion']],
'path_pair': [[('Start', 'play_series'),( 'play_series', 'Null')],
[('Start', 'play_tv'),('play_tv', 'purchase'),('purchase', 'Conversion')]]})
You can update df2 by applying a function to column 'path_pair' that looks up values in df1:
df2['path_pair'] = df2['path_pair'].apply(lambda lst: [df1.loc[x,y] for (x,y) in lst])
Output:
session_id path path_pair
0 T01 [Start, play_series, Null] [0.03, 0.09]
1 T02 [Start, play_tv, purchase, Conversion] [0.02, 0.15, 0.26]
I'd like to find an efficient way to use the df.groupby() function in pandas to return both the means and standard deviations of a data frame - preferably in one shot!
import pandas as PD
df = pd.DataFrame({'case':[1, 1, 2, 2, 3, 3],
'condition':[1,2,1,2,1,2],
'var_a':[0.92, 0.88, 0.90, 0.79, 0.94, 0.85],
'var_b':[0.21, 0.15, 0.1, 0.16, 0.17, 0.23]})
with that data, I'd like an easier way (if there is one!) to perform the following:
grp_means = df.groupby('case', as_index=False).mean()
grp_sems = df.groupby('case', as_index=False).sem()
grp_means.rename(columns={'var_a':'var_a_mean', 'var_b':'var_b_mean'},
inplace=True)
grp_sems.rename(columns={'var_a':'var_a_SEM', 'var_b':'var_b_SEM'},
inplace=True)
grouped = pd.concat([grp_means, grp_sems[['var_a_SEM', 'var_b_SEM']]], axis=1)
grouped
Out[1]:
case condition var_a_mean var_b_mean var_a_SEM var_b_SEM
0 1 1.5 0.900 0.18 0.900 0.18
1 2 1.5 0.845 0.13 0.845 0.13
2 3 1.5 0.895 0.20 0.895 0.20
I also recently learned of the .agg() function, and tried df.groupby('grouper column') agg('var':'mean', 'var':sem') but this just returns a SyntaxError.
I think need DataFrameGroupBy.agg, but then remove column ('condition','sem') and map for convert MultiIndex to columns:
df = df.groupby('case').agg(['mean','sem']).drop(('condition','sem'), axis=1)
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
case condition_mean var_a_mean var_a_sem var_b_mean var_b_sem
0 1 1.5 0.900 0.020 0.18 0.03
1 2 1.5 0.845 0.055 0.13 0.03
2 3 1.5 0.895 0.045 0.20 0.03
I am struggling with python dictionaries. I created a dictionary that looks like:
d = {'0.500': ['18.4 0.5', '17.9 0.4', '16.9 0.4', '18.6 0.4'],
'1.000': ['14.8 0.5', '14.9 0.5', '15.6 0.4', '15.9 0.3'],
'0.000': ['23.2 0.5', '23.2 0.8', '23.2 0.7', '23.2 0.1']}
and I would like to end up having:
0.500 17.95 0.425
which is the key, average of (18.4+17.9+16.9+18.6), average of (0.5+0.4+0.4+0.4)
(and the same for 1.000 and 0.000 with their corresponding averages)
Initially my dictionary had only two values, so I could rely on indexes:
for key in d:
dvdl1 = d[key][0].split(" ")[0]
dvdl2 = d[key][1].split(" ")[0]
average = ((float(dvdl1)+float(dvdl2))/2)
but now I would like to have my code working for different dictionary lengths with lets say 4 (example above) or 5 or 6 values each...
Cheers!
for k,v in d.iteritems():
col1, col2 = zip(*[map(float,x.split()) for x in v])
print k, sum(col1)/len(v), sum(col2)/len(v)
...
0.500 17.95 0.425
1.000 15.3 0.425
0.000 23.2 0.525
How this works:
>>> v = ['18.4 0.5', '17.9 0.4', '16.9 0.4', '18.6 0.4']
first split each item at white-spaces and apply float to them, so we get a lists of lists:
>>> zipp = [map(float,x.split()) for x in v]
>>> zipp
[[18.4, 0.5], [17.9, 0.4], [16.9, 0.4], [18.6, 0.4]] #list of rows
Now we can use zip with * which acts as un-zipping and we will get a list of columns.
>>> zip(*zipp)
[(18.4, 17.9, 16.9, 18.6), (0.5, 0.4, 0.4, 0.4)]