DataFrame Create new column after applying a function on groupby values

DataFrame Create new column after applying a function on groupby values - python

I have such a dataframe :
With a minimal example :
d = {'Subject': [1,1,1,1,2,2,3,3,3,3,3,3,3],
'Pattern': [1,1,2,2,3,3,2,2,2,2,2,2,2],
'Time': [0.85, 0.92, 1.03, 1.06, 0.89, 0.85, 1.20, 1.03, 1.25, 100.03, 1.97,0.23,0.64]}
df = pd.DataFrame(data=d)
Where Subject ranges from 1 to 8 and Pattern from 1 to 3. I want to create a new column where after grouping by Subject and Pattern I apply a function that removes outliers from the Time list associated to the groupby. Right now I have a solution that works well, but I was wondering if there would be a more elegant solution to it, so that I learn how to interact better with DataFrame. Taking the example, it should output :
Subject Pattern Time Time_2
0 1 1 0.85 0.85
1 1 1 0.92 0.92
2 1 2 1.03 1.03
3 1 2 1.06 1.06
4 2 3 0.89 0.89
5 2 3 0.85 0.85
6 3 2 1.20 1.20
7 3 2 1.03 1.03
8 3 2 1.25 1.25
9 3 2 100.03 0.00 # <---
10 3 2 1.97 1.97
11 3 2 0.23 0.23
12 3 2 0.64 0.64
My current code :
def remove_outliers(arr):
elements = np.array(arr)
mean = np.mean(elements)
sd = np.std(elements)
return [x if (mean - 2 * sd < x < mean + 2 * sd) else 0 for x in arr]
df_g = df.groupby(['Subject', 'Pattern'])['Time']
times = []
keys = list(df_g.groups.keys())
for i, l in enumerate(df_g.apply(list)):
times.append((keys[i], remove_outliers(l)))
df['Time_2'] = 0
for k, l in times:
vals = df[(df['Subject'] == k[0]) & (df['Pattern'] == k[1])].index.values
df['Time_2'].iloc[vals] = l

Try this -
Use groupby transform the groups to get GroupWise mean and std for each row.
Next use these series objects to create your check condition as per your function.
Next inverse this and use df.mask to mask values that lie outside this range, and fill them with 0 instead.
grouper = df.groupby(['Subject', 'Pattern'])['Time']
mean = grouper.transform('mean')
std = grouper.transform('std').fillna(0)
check = (df['Time'] < (mean - 2*std)) | (df['Time'] > (mean + 2*std))
df['Time_new'] = df['Time'].mask(check).fillna(0)
print(df)
Subject Pattern Time Time_new
0 1 1 0.85 0.85
1 1 1 0.92 0.92
2 1 2 1.03 1.03
3 1 2 1.06 1.06
4 2 3 0.89 0.89
5 2 3 0.85 0.85
6 3 2 1.20 1.20
7 3 2 1.03 1.03
8 3 2 1.25 1.25
9 3 2 100.03 0.00 #<---
10 3 2 1.97 1.97
11 3 2 0.23 0.23
12 3 2 0.64 0.64
NOTE: Jsut to add the 3std deviation condition is too high a range for your example. Try 2std.

Related

Map numeric data into bins in Pandas dataframe for seperate groups using dictionaries

I have a pandas dataframe as follows:
df = pd.DataFrame(data = [[1,0.56],[1,0.59],[1,0.62],[1,0.83],[2,0.85],[2,0.01],[2,0.79],[3,0.37],[3,0.99],[3,0.48],[3,0.55],[3,0.06]],columns=['polyid','value'])
polyid value
0 1 0.56
1 1 0.59
2 1 0.62
3 1 0.83
4 2 0.85
5 2 0.01
6 2 0.79
7 3 0.37
8 3 0.99
9 3 0.48
10 3 0.55
11 3 0.06
I need to reclassify the 'value' column separately for each 'polyid'. For the reclassification, I have two dictionaries. One with the bins that contain the information on how I want to cut the 'values' for each 'polyid' separately:
bins_dic = {1:[0,0.6,0.8,1], 2:[0,0.2,0.9,1], 3:[0,0.5,0.6,1]}
And one with the ids with which I want to label the resulting bins:
ids_dic = {1:[1,2,3], 2:[1,2,3], 3:[1,2,3]}
I tried to get this answer to work for my use case. I could only come up with applying pd.cut on each 'polyid' subset and then pd.concat all subsets again back to one dataframe:
import pandas as pd
def reclass_df_dic(df, bins_dic, names_dic, bin_key_col, val_col, name_col):
df_lst = []
for key in df[bin_key_col].unique():
bins = bins_dic[key]
names = names_dic[key]
sub_df = df[df[bin_key_col] == key]
sub_df[name_col] = pd.cut(df[val_col], bins, labels=names)
df_lst.append(sub_df)
return(pd.concat(df_lst))
df = pd.DataFrame(data = [[1,0.56],[1,0.59],[1,0.62],[1,0.83],[2,0.85],[2,0.01],[2,0.79],[3,0.37],[3,0.99],[3,0.48],[3,0.55],[3,0.06]],columns=['polyid','value'])
bins_dic = {1:[0,0.6,0.8,1], 2:[0,0.2,0.9,1], 3:[0,0.5,0.6,1]}
ids_dic = {1:[1,2,3], 2:[1,2,3], 3:[1,2,3]}
df = reclass_df_dic(df, bins_dic, ids_dic, 'polyid', 'value', 'id')
This results in my desired output:
polyid value id
0 1 0.56 1
1 1 0.59 1
2 1 0.62 2
3 1 0.83 3
4 2 0.85 2
5 2 0.01 1
6 2 0.79 2
7 3 0.37 1
8 3 0.99 3
9 3 0.48 1
10 3 0.55 2
11 3 0.06 1
However, the line:
sub_df[name_col] = pd.cut(df[val_col], bins, labels=names)
raises the warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
that I am unable to solve with using .loc. Also, I guess there generally is a more efficient way of doing this without having to loop over each category?

A simpler solution would be to use groupby and apply a custom function on each group. In this case, we can define a function reclass that obtains the correct bins and ids and then uses pd.cut:
def reclass(group, name):
bins = bins_dic[name]
ids = ids_dic[name]
return pd.cut(group, bins, labels=ids)
df['id'] = df.groupby('polyid')['value'].apply(lambda x: reclass(x, x.name))
Result:
polyid value id
0 1 0.56 1
1 1 0.59 1
2 1 0.62 2
3 1 0.83 3
4 2 0.85 2
5 2 0.01 1
6 2 0.79 2
7 3 0.37 1
8 3 0.99 3
9 3 0.48 1
10 3 0.55 2
11 3 0.06 1

Generate new column based on values in another column and their index

In the df underneath, I want to sort the values of column 'cdf_X' based on column 'A' and 'X'. Column 'X' and 'cdf_X' are connected, so if a value in 'X' appears in column 'A', the value of 'cdf_X' should be repositioned to that index number of column 'A' in a new column. (Values don't occur twice in a column 'cdf_A'.)
Example: 'X'=3 at index 0 -> cdf_X=0.05 at index 0 -> '3' appears in column 'A' at index 4 -> cdf_A at index 4 = cdf_X at index 0
Initial df:
A X cdf_X
0 7 3 0.05
1 4 4 0.15
2 11 7 0.27
3 9 9 0.45
4 3 11 0.69
5 13 13 1.00
Desired df:
A X cdf_X cdf_A
0 7 3 0.05 0.27
1 4 4 0.15 0.15
2 11 7 0.27 0.69
3 9 9 0.45 0.45
4 3 11 0.69 0.05
5 13 13 1.00 1.00
Tried code:
import pandas as pd
df = pd.DataFrame({"A": [7,4,11,9,3,13], "cdf_X": [0.05,0.15,0.27,0.45,0.69,1.00], "X": [3,4,7,9,11,13]})
df.loc[:, 'cdf_A'] = df['cdf_X'].where(df['A'] == df['X'])
 print(df)

Check with map
df['cdf_A'] = df.A.map(df.set_index('X')['cdf'])

I think you need replace
df['cdf_A'] = df.A.replace(df.set_index('X').cdf)
Out[989]:
A X cdf cdf_A
0 7 3 0.05 0.27
1 4 4 0.15 0.15
2 11 7 0.27 0.69
3 9 9 0.45 0.45
4 3 11 0.69 0.05
5 13 13 1.00 1.00

Add a column of normalised values based on sections of a dataframe column

I have got a dataframe of several hundred thousand rows. Which is of the following format:
time_elapsed cycle
0 0.00 1
1 0.50 1
2 1.00 1
3 1.30 1
4 1.50 1
5 0.00 2
6 0.75 2
7 1.50 2
8 3.00 2
I want to create a third column that will give me the percentage of each time instance that the row is of the cycle (until the next time_elapsed = 0). To give something like:
time_elapsed cycle percentage
0 0.00 1 0
1 0.50 1 33
2 1.00 1 75
3 1.30 1 87
4 1.50 1 100
5 0.00 2 0
6 0.75 2 25
7 1.50 2 50
8 3.00 2 100
I'm not fussed about the number of decimal places, I've just excluded them for ease here.
I started going along this route, but I keep getting errors.
data['percentage'] = data['time_elapsed'].sub(data.groupby(['cycle'])['time_elapsed'].transform(lambda x: x*100/data['time_elapsed'].max()))
I think it's the lambda function causing errors, but I'm not sure what I should do to change it. Any help is much appreciated :)

Use Series.div for division instead sub for subtract, then solution is simplify - get only max per groups, multiple by Series.mul, if necessary Series.round and last convert to integers by Series.astype:
s = data.groupby(['cycle'])['time_elapsed'].transform('max')
data['percentage'] = data['time_elapsed'].div(s).mul(100).round().astype(int)
print (data)
time_elapsed cycle percentage
0 0.00 1 0
1 0.50 1 33
2 1.00 1 67
3 1.30 1 87
4 1.50 1 100
5 0.00 2 0
6 0.75 2 25
7 1.50 2 50
8 3.00 2 100

Concat two DataFrames on missing indices

I have two DataFrames and want to use the second one only on the rows whose index is not already contained in the first one.
What is the most efficient way to do this?
Example:
df_1
idx val
0 0.32
1 0.54
4 0.26
5 0.76
7 0.23
df_2
idx val
1 10.24
2 10.90
3 10.66
4 10.25
6 10.13
7 10.52
df_final
idx val
0 0.32
1 0.54
2 10.90
3 10.66
4 0.26
5 0.76
6 10.13
7 0.23
Recap: I need to add the rows in df_2 for which the index is not already in df_1.
EDIT
Removed some indices in df_2 to illustrate the fact that all indices from df_1 are not covered in df_2.

You can use reindex with combine_first or fillna:
df = df_1.reindex(df_2.index).combine_first(df_2)
print (df)
val
idx
0 0.32
1 0.54
2 10.90
3 10.66
4 0.26
5 0.76
6 10.13
7 0.23
df = df_1.reindex(df_2.index).fillna(df_2)
print (df)
val
idx
0 0.32
1 0.54
2 10.90
3 10.66
4 0.26
5 0.76
6 10.13
7 0.23

You can achieve the wanted output by using the combine_first method of the DataFrame. From the documentation of the method:
Combine two DataFrame objects and default to non-null values in frame calling the method. Result index columns will be the union of the respective indexes and columns
Example usage:
import pandas as pd
df_1 = pd.DataFrame([0.32,0.54,0.26,0.76,0.23], columns=['val'], index=[0,1,4,5,7])
df_1.index.name = 'idx'
df_2 = pd.DataFrame([10.56,10.24,10.90,10.66,10.25,10.13,10.52], columns=['val'], index=[0,1,2,3,4,6,7])
df_2.index.name = 'idx'
df_final = df_1.combine_first(df_2)
This will give the desired result:
In [7]: df_final
Out[7]:
val
idx
0 0.32
1 0.54
2 10.90
3 10.66
4 0.26
5 0.76
6 10.13
7 0.23

Pandas column of lists, create a row for each list element

I have a dataframe where some cells contain lists of multiple values. Rather than storing multiple
values in a cell, I'd like to expand the dataframe so that each item in the list gets its own row (with the same values in all other columns). So if I have:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'trial_num': [1, 2, 3, 1, 2, 3],
'subject': [1, 1, 1, 2, 2, 2],
'samples': [list(np.random.randn(3).round(2)) for i in range(6)]
}
)
df
Out[10]:
samples subject trial_num
0 [0.57, -0.83, 1.44] 1 1
1 [-0.01, 1.13, 0.36] 1 2
2 [1.18, -1.46, -0.94] 1 3
3 [-0.08, -4.22, -2.05] 2 1
4 [0.72, 0.79, 0.53] 2 2
5 [0.4, -0.32, -0.13] 2 3
How do I convert to long form, e.g.:
subject trial_num sample sample_num
0 1 1 0.57 0
1 1 1 -0.83 1
2 1 1 1.44 2
3 1 2 -0.01 0
4 1 2 1.13 1
5 1 2 0.36 2
6 1 3 1.18 0
# etc.
The index is not important, it's OK to set existing
columns as the index and the final ordering isn't
important.

Pandas >= 0.25
Series and DataFrame methods define a .explode() method that explodes lists into separate rows. See the docs section on Exploding a list-like column.
df = pd.DataFrame({
'var1': [['a', 'b', 'c'], ['d', 'e',], [], np.nan],
'var2': [1, 2, 3, 4]
})
df
var1 var2
0 [a, b, c] 1
1 [d, e] 2
2 [] 3
3 NaN 4
df.explode('var1')
var1 var2
0 a 1
0 b 1
0 c 1
1 d 2
1 e 2
2 NaN 3 # empty list converted to NaN
3 NaN 4 # NaN entry preserved as-is
# to reset the index to be monotonically increasing...
df.explode('var1').reset_index(drop=True)
var1 var2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 NaN 3
6 NaN 4
Note that this also handles mixed columns of lists and scalars, as well as empty lists and NaNs appropriately (this is a drawback of repeat-based solutions).
However, you should note that explode only works on a single column (for now).
P.S.: if you are looking to explode a column of strings, you need to split on a separator first, then use explode. See this (very much) related answer by me.

A bit longer than I expected:
>>> df
samples subject trial_num
0 [-0.07, -2.9, -2.44] 1 1
1 [-1.52, -0.35, 0.1] 1 2
2 [-0.17, 0.57, -0.65] 1 3
3 [-0.82, -1.06, 0.47] 2 1
4 [0.79, 1.35, -0.09] 2 2
5 [1.17, 1.14, -1.79] 2 3
>>>
>>> s = df.apply(lambda x: pd.Series(x['samples']),axis=1).stack().reset_index(level=1, drop=True)
>>> s.name = 'sample'
>>>
>>> df.drop('samples', axis=1).join(s)
subject trial_num sample
0 1 1 -0.07
0 1 1 -2.90
0 1 1 -2.44
1 1 2 -1.52
1 1 2 -0.35
1 1 2 0.10
2 1 3 -0.17
2 1 3 0.57
2 1 3 -0.65
3 2 1 -0.82
3 2 1 -1.06
3 2 1 0.47
4 2 2 0.79
4 2 2 1.35
4 2 2 -0.09
5 2 3 1.17
5 2 3 1.14
5 2 3 -1.79
If you want sequential index, you can apply reset_index(drop=True) to the result.
update:
>>> res = df.set_index(['subject', 'trial_num'])['samples'].apply(pd.Series).stack()
>>> res = res.reset_index()
>>> res.columns = ['subject','trial_num','sample_num','sample']
>>> res
subject trial_num sample_num sample
0 1 1 0 1.89
1 1 1 1 -2.92
2 1 1 2 0.34
3 1 2 0 0.85
4 1 2 1 0.24
5 1 2 2 0.72
6 1 3 0 -0.96
7 1 3 1 -2.72
8 1 3 2 -0.11
9 2 1 0 -1.33
10 2 1 1 3.13
11 2 1 2 -0.65
12 2 2 0 0.10
13 2 2 1 0.65
14 2 2 2 0.15
15 2 3 0 0.64
16 2 3 1 -0.10
17 2 3 2 -0.76

UPDATE: the solution below was helpful for older Pandas versions, because the DataFrame.explode() wasn’t available. Starting from Pandas 0.25.0 you can simply use DataFrame.explode().
lst_col = 'samples'
r = pd.DataFrame({
col:np.repeat(df[col].values, df[lst_col].str.len())
for col in df.columns.drop(lst_col)}
).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns]
Result:
In [103]: r
Out[103]:
samples subject trial_num
0 0.10 1 1
1 -0.20 1 1
2 0.05 1 1
3 0.25 1 2
4 1.32 1 2
5 -0.17 1 2
6 0.64 1 3
7 -0.22 1 3
8 -0.71 1 3
9 -0.03 2 1
10 -0.65 2 1
11 0.76 2 1
12 1.77 2 2
13 0.89 2 2
14 0.65 2 2
15 -0.98 2 3
16 0.65 2 3
17 -0.30 2 3
PS here you may find a bit more generic solution
UPDATE: some explanations: IMO the easiest way to understand this code is to try to execute it step-by-step:
in the following line we are repeating values in one column N times where N - is the length of the corresponding list:
In [10]: np.repeat(df['trial_num'].values, df[lst_col].str.len())
Out[10]: array([1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 3, 3, 3], dtype=int64)
this can be generalized for all columns, containing scalar values:
In [11]: pd.DataFrame({
...: col:np.repeat(df[col].values, df[lst_col].str.len())
...: for col in df.columns.drop(lst_col)}
...: )
Out[11]:
trial_num subject
0 1 1
1 1 1
2 1 1
3 2 1
4 2 1
5 2 1
6 3 1
.. ... ...
11 1 2
12 2 2
13 2 2
14 2 2
15 3 2
16 3 2
17 3 2
[18 rows x 2 columns]
using np.concatenate() we can flatten all values in the list column (samples) and get a 1D vector:
In [12]: np.concatenate(df[lst_col].values)
Out[12]: array([-1.04, -0.58, -1.32, 0.82, -0.59, -0.34, 0.25, 2.09, 0.12, 0.83, -0.88, 0.68, 0.55, -0.56, 0.65, -0.04, 0.36, -0.31])
putting all this together:
In [13]: pd.DataFrame({
...: col:np.repeat(df[col].values, df[lst_col].str.len())
...: for col in df.columns.drop(lst_col)}
...: ).assign(**{lst_col:np.concatenate(df[lst_col].values)})
Out[13]:
trial_num subject samples
0 1 1 -1.04
1 1 1 -0.58
2 1 1 -1.32
3 2 1 0.82
4 2 1 -0.59
5 2 1 -0.34
6 3 1 0.25
.. ... ... ...
11 1 2 0.68
12 2 2 0.55
13 2 2 -0.56
14 2 2 0.65
15 3 2 -0.04
16 3 2 0.36
17 3 2 -0.31
[18 rows x 3 columns]
using pd.DataFrame()[df.columns] will guarantee that we are selecting columns in the original order...

you can also use pd.concat and pd.melt for this:
>>> objs = [df, pd.DataFrame(df['samples'].tolist())]
>>> pd.concat(objs, axis=1).drop('samples', axis=1)
subject trial_num 0 1 2
0 1 1 -0.49 -1.00 0.44
1 1 2 -0.28 1.48 2.01
2 1 3 -0.52 -1.84 0.02
3 2 1 1.23 -1.36 -1.06
4 2 2 0.54 0.18 0.51
5 2 3 -2.18 -0.13 -1.35
>>> pd.melt(_, var_name='sample_num', value_name='sample',
... value_vars=[0, 1, 2], id_vars=['subject', 'trial_num'])
subject trial_num sample_num sample
0 1 1 0 -0.49
1 1 2 0 -0.28
2 1 3 0 -0.52
3 2 1 0 1.23
4 2 2 0 0.54
5 2 3 0 -2.18
6 1 1 1 -1.00
7 1 2 1 1.48
8 1 3 1 -1.84
9 2 1 1 -1.36
10 2 2 1 0.18
11 2 3 1 -0.13
12 1 1 2 0.44
13 1 2 2 2.01
14 1 3 2 0.02
15 2 1 2 -1.06
16 2 2 2 0.51
17 2 3 2 -1.35
last, if you need you can sort base on the first the first three columns.

Trying to work through Roman Pekar's solution step-by-step to understand it better, I came up with my own solution, which uses melt to avoid some of the confusing stacking and index resetting. I can't say that it's obviously a clearer solution though:
items_as_cols = df.apply(lambda x: pd.Series(x['samples']), axis=1)
# Keep original df index as a column so it's retained after melt
items_as_cols['orig_index'] = items_as_cols.index
melted_items = pd.melt(items_as_cols, id_vars='orig_index',
var_name='sample_num', value_name='sample')
melted_items.set_index('orig_index', inplace=True)
df.merge(melted_items, left_index=True, right_index=True)
Output (obviously we can drop the original samples column now):
samples subject trial_num sample_num sample
0 [1.84, 1.05, -0.66] 1 1 0 1.84
0 [1.84, 1.05, -0.66] 1 1 1 1.05
0 [1.84, 1.05, -0.66] 1 1 2 -0.66
1 [-0.24, -0.9, 0.65] 1 2 0 -0.24
1 [-0.24, -0.9, 0.65] 1 2 1 -0.90
1 [-0.24, -0.9, 0.65] 1 2 2 0.65
2 [1.15, -0.87, -1.1] 1 3 0 1.15
2 [1.15, -0.87, -1.1] 1 3 1 -0.87
2 [1.15, -0.87, -1.1] 1 3 2 -1.10
3 [-0.8, -0.62, -0.68] 2 1 0 -0.80
3 [-0.8, -0.62, -0.68] 2 1 1 -0.62
3 [-0.8, -0.62, -0.68] 2 1 2 -0.68
4 [0.91, -0.47, 1.43] 2 2 0 0.91
4 [0.91, -0.47, 1.43] 2 2 1 -0.47
4 [0.91, -0.47, 1.43] 2 2 2 1.43
5 [-1.14, -0.24, -0.91] 2 3 0 -1.14
5 [-1.14, -0.24, -0.91] 2 3 1 -0.24
5 [-1.14, -0.24, -0.91] 2 3 2 -0.91

For those looking for a version of Roman Pekar's answer that avoids manual column naming:
column_to_explode = 'samples'
res = (df
.set_index([x for x in df.columns if x != column_to_explode])[column_to_explode]
.apply(pd.Series)
.stack()
.reset_index())
res = res.rename(columns={
res.columns[-2]:'exploded_{}_index'.format(column_to_explode),
res.columns[-1]: '{}_exploded'.format(column_to_explode)})

I found the easiest way was to:
Convert the samples column into a DataFrame
Joining with the original df
Melting
Shown here:
df.samples.apply(lambda x: pd.Series(x)).join(df).\
melt(['subject','trial_num'],[0,1,2],var_name='sample')
subject trial_num sample value
0 1 1 0 -0.24
1 1 2 0 0.14
2 1 3 0 -0.67
3 2 1 0 -1.52
4 2 2 0 -0.00
5 2 3 0 -1.73
6 1 1 1 -0.70
7 1 2 1 -0.70
8 1 3 1 -0.29
9 2 1 1 -0.70
10 2 2 1 -0.72
11 2 3 1 1.30
12 1 1 2 -0.55
13 1 2 2 0.10
14 1 3 2 -0.44
15 2 1 2 0.13
16 2 2 2 -1.44
17 2 3 2 0.73
It's worth noting that this may have only worked because each trial has the same number of samples (3). Something more clever may be necessary for trials of different sample sizes.

import pandas as pd
df = pd.DataFrame([{'Product': 'Coke', 'Prices': [100,123,101,105,99,94,98]},{'Product': 'Pepsi', 'Prices': [101,104,104,101,99,99,99]}])
print(df)
df = df.assign(Prices=df.Prices.str.split(',')).explode('Prices')
print(df)
Try this in pandas >=0.25 version

Very late answer but I want to add this:
A fast solution using vanilla Python that also takes care of the sample_num column in OP's example. On my own large dataset with over 10 million rows and a result with 28 million rows this only takes about 38 seconds. The accepted solution completely breaks down with that amount of data and leads to a memory error on my system that has 128GB of RAM.
df = df.reset_index(drop=True)
lstcol = df.lstcol.values
lstcollist = []
indexlist = []
countlist = []
for ii in range(len(lstcol)):
lstcollist.extend(lstcol[ii])
indexlist.extend([ii]*len(lstcol[ii]))
countlist.extend([jj for jj in range(len(lstcol[ii]))])
df = pd.merge(df.drop("lstcol",axis=1),pd.DataFrame({"lstcol":lstcollist,"lstcol_num":countlist},
index=indexlist),left_index=True,right_index=True).reset_index(drop=True)

Also very late, but here is an answer from Karvy1 that worked well for me if you don't have pandas >=0.25 version: https://stackoverflow.com/a/52511166/10740287
For the example above you may write:
data = [(row.subject, row.trial_num, sample) for row in df.itertuples() for sample in row.samples]
data = pd.DataFrame(data, columns=['subject', 'trial_num', 'samples'])
Speed test:
%timeit data = pd.DataFrame([(row.subject, row.trial_num, sample) for row in df.itertuples() for sample in row.samples], columns=['subject', 'trial_num', 'samples'])
1.33 ms ± 74.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit data = df.set_index(['subject', 'trial_num'])['samples'].apply(pd.Series).stack().reset_index()
4.9 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit data = pd.DataFrame({col:np.repeat(df[col].values, df['samples'].str.len())for col in df.columns.drop('samples')}).assign(**{'samples':np.concatenate(df['samples'].values)})
1.38 ms ± 25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

DataFrame Create new column after applying a function on groupby values - python

Related

Map numeric data into bins in Pandas dataframe for seperate groups using dictionaries

Generate new column based on values in another column and their index

Add a column of normalised values based on sections of a dataframe column

Concat two DataFrames on missing indices

Pandas column of lists, create a row for each list element

Categories

Resources