Pandas group by timestamp and id and count - python

I have a dataframe in the following format:
import pandas as pd
d1 = {'ID': ['A','A','A','B','B','B','B','B','C'],
'Time':
['1/18/2016','2/17/2016','2/16/2016','1/15/2016','2/14/2016','2/13/2016',
'1/12/2016','2/9/2016','1/11/2016'],
'Product_ID': ['2','1','1','1','1','2','1','2','2'],
'Var_1': [0.11,0.22,0.09,0.07,0.4,0.51,0.36,0.54,0.19],
'Var_2': [1,0,1,0,1,0,1,0,1],
'Var_3': ['1','1','1','1','0','1','1','0','0']}
df1 = pd.DataFrame(d1)
Where df1 is of the form:
ID Time Product_ID Var_1 Var_2 Var_3
A 1/18/2016 2 0.11 1 1
A 2/17/2016 1 0.22 0 1
A 2/16/2016 1 0.09 1 1
B 1/15/2016 1 0.07 0 1
B 2/14/2016 1 0.4 1 0
B 2/13/2016 2 0.51 0 1
B 1/12/2016 1 0.36 1 1
B 2/9/2016 2 0.54 0 0
C 1/11/2016 2 0.19 1 0
where time is in 'MM/DD/YY' format.
This is what I have to do:
1) I would like to do is to group ID's and Product ID's by Time (Specifically by each Month).
2) I want to then carry out the following column operations.
a) First, I would like to find the sum of the columns of Var_2 and Var_3 and
b) find the mean of the column Var_1.
3) Then, I would like to create a column of count of each ID and Product_ID for each month.
4) And finally, I would also like to input items ID and Product ID for which there is no entries.
For example, for ID = A and Product ID = 1 in Time = 2016-1 (January 2016), there are no observations and thus all variables take the value of 0.
Again, For ID = A and Product ID = 1 in Time = 2016-2 (January 2016), Var_1 = (.22+.09)/2 = 0.155 Var_2 = 1, Var_3 = 1+1=2 and finally Count = 2.
This is the output that I would like.
ID Product_ID Time Var_1 Var_2 Var_3 Count
A 1 2016-1 0 0 0 0
A 1 2016-2 0.155 1 2 2
B 1 2016-1 0.215 1 1 2
B 1 2016-2 1 0.4 0 1
C 1 2016-1 0 0 0 0
C 1 2016-2 0 0 0 0
A 2 2016-1 0.11 1 1 1
A 2 2016-2 0 0 0 0
B 2 2016-1 0 0 0 0
B 2 2016-2 0.455 1 2 2
C 2 2016-1 0.19 1 0 1
C 2 2016-2 0 0 0 0
This is a little more than my programming capabilities (I know the groupby function exits but I could not figure out how to incorporate the rest of the changes). Please let me know if you have questions.
Any help will be appreciated. Thanks.

I break down the steps.
df1.Time=pd.to_datetime(df1.Time)
df1.Time=df1.Time.dt.month+df1.Time.dt.year*100
df1['Var_3']=df1['Var_3'].astype(int)
output=df1.groupby(['ID','Product_ID','Time']).agg({'Var_1':'mean','Var_2':'sum','Var_3':'sum'})
output=output.unstack(2).stack(dropna=False).fillna(0)# missing one .
output['Count']=output.max(1)
output.reset_index().sort_values(['Product_ID','ID'])
Out[1032]:
ID Product_ID Time Var_3 Var_2 Var_1 Count
0 A 1 201601 0.0 0.0 0.000 0.0
1 A 1 201602 2.0 1.0 0.155 2.0
4 B 1 201601 2.0 1.0 0.215 2.0
5 B 1 201602 0.0 1.0 0.400 1.0
2 A 2 201601 1.0 1.0 0.110 1.0
3 A 2 201602 0.0 0.0 0.000 0.0
6 B 2 201601 0.0 0.0 0.000 0.0
7 B 2 201602 1.0 0.0 0.525 1.0
8 C 2 201601 0.0 1.0 0.190 1.0
9 C 2 201602 0.0 0.0 0.000 0.0

Related

How to rank the categorical values while one-hot-encoding

I have the data like this.
id
feature_1
feature_2
1
a
e
2
b
c
3
c
d
4
d
b
5
e
a
I want the one-hot-encoded like feature with the first column representing 1 and the second column representing 0.5. Like the following table.
id
a
b
c
d
e
1
1
0
0
0
0.5
2
0
1
0.5
0
0
3
0
0
1
0.5
0
4
0
0.5
0
1
0
5
0.5
0
0
0
1
But when applying sklearn.preprocessing.OneHotEncoder
it outputs 10 columns with respective 1s.
How can I achieve this?
For the two columns, you can do:
pd.crosstab(df.id, df.feature_1) + pd.crosstab(df['id'], df['feature_2']) * .5
Output:
feature_1 a b c d e
id
1 1.0 0.0 0.0 0.0 0.5
2 0.0 1.0 0.5 0.0 0.0
3 0.0 0.0 1.0 0.5 0.0
4 0.0 0.5 0.0 1.0 0.0
5 0.5 0.0 0.0 0.0 1.0
If you have more than two features, with the weights defined, then you can melt then map the features to the weights:
weights = {'feature_1':1, 'feature_2':0.5}
flatten = df.melt('id')
(flatten['variable'].map(weights)
.groupby([flattern['id'], flatten['value']])
.sum().unstack('value', fill_value=0)
)

Matrix are not aligned for dot product

Here is my df matrix:
0 Rooms Area Price
0 0 0.4 0.32 0.307692
1 0 0.4 0.40 0.461538
2 0 0.6 0.48 0.615385
3 0 0.6 0.56 0.646154
4 0 0.6 0.60 0.692308
5 0 0.8 0.72 0.769231
6 0 0.8 0.80 0.846154
7 0 1.0 1.00 1.000000
Here is my B matrix:
weights
0 88
1 87
2 44
3 46
When I write df.dot(B) it says matrix are not aligned.
But here df is 8*4 matrix and B is 4*1
So shouldn't it generate a `8*1 matrix as a dot product?
errors: ValueError: matrices are not aligned
You can just do mul
out = df.mul(B['weights'].values,axis=1)
Out[207]:
0 Rooms Area Price
0 0 34.8 14.08 14.153832
1 0 34.8 17.60 21.230748
2 0 52.2 21.12 28.307710
3 0 52.2 24.64 29.723084
4 0 52.2 26.40 31.846168
5 0 69.6 31.68 35.384626
6 0 69.6 35.20 38.923084
7 0 87.0 44.00 46.000000
"the column names of DataFrame and the index of other must contain the same values,"
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dot.html
Try renaming the index of B to match the column names of A.

is there a better way to handle NaN values?

I have an input dataframe
KPI_ID KPI_Key1 KPI_Key2 KPI_Key3
A (C602+C603) C601 75
B (C605+C606) C602 NaN
C 75 L239+C602 NaN
D (32*(C603+44)) 75 NaN
E L239 NaN C601
I have an Indicator df
99 75 C604 C602 C601 C603 C605 C606 44 L239 32
PatientID
1 1 0 1 0 1 0 0 0 1 0 1
2 0 0 0 0 0 0 1 1 0 0 0
3 1 1 1 1 0 1 1 1 1 1 1
4 0 0 0 0 0 1 0 1 0 1 0
5 1 0 1 1 1 1 0 1 1 1 1
source:
input_df = pd.DataFrame({'KPI_ID': ['A','B','C','D','E'], 'KPI_Key1': ['(C602+C603)','(C605+C606)','75','(32*(C603+44))','L239'] , 'KPI_Key2' : ['C601','C602','L239+C602','75',np.NaN] , 'KPI_Key3' : ['75',np.NaN,np.NaN,np.NaN,'C601']})
indicator_df = pd.DataFrame({'PatientID': [1,2,3,4,5],'99' : ['1','0','1','0','1'],'75' : ['0','0','1','0','0'],'C604' : ['1','0','1','0','1'],'C602' : ['0','0','1','0','1'],'C601' : ['1','0','0','0','1'],'C603' : ['0','0','1','1','1'],'C605' : ['0','1','1','0','0'],'C606' : ['0','1','1','1','1'],'44' : ['1','0','1','0','1'],'L239' : ['0','0','1','1','1'], '32' : ['1','0','1','0','1'],}).set_index('PatientID')
My Goal is to create an output df like this (by evaluating the input_df against indicator_df )
final_out_df:
PatientID KPI_ID KPI_Key1 KPI_Key2 KPI_Key3
1 A 0 1 0
2 A 0 0 0
3 A 2 0 1
4 A 1 0 0
5 A 2 1 0
1 B 0 0 0
2 B 2 0 0
3 B 2 1 0
... ... ... ... ...
I am VERY Close and my logic works fine except I am unable to handle the NaN values in the input_df.I am able to generate the output for KPI_ID 'A' since none of the three formulas (KPI_Key1,KPI_Key2,KPI_Key3 for 'A') are null. But I fail to generate it for 'B'. Is there anything I can do instead of using a dummy variuable in place of NaN and creating that row in indicator_df?
Here is what I did so far:
indicator_df = indicator_df.astype('int32')
final_out_df = pd.DataFrame()
out_df = pd.DataFrame(index=indicator_df.index)
out_df.reset_index(level=0, inplace=True)
final_out_df = pd.DataFrame()
#running loop only for 'A' so it won't fail
for i in range(0,len(input_df)-4):
for j in ['KPI_Key1','KPI_Key2','KPI_Key3']:
exp = input_df[j].iloc[i]
temp_out_df=indicator_df.eval(re.sub(r'(\w+)', r'`\1`', exp)).reset_index(name=j)
out_df['KPI_ID'] = input_df['KPI_ID'].iloc[i]
out_df = out_df.merge(temp_out_df, on='PatientID', how='left')
final_out_df= final_out_df.append(out_df)
out_df = pd.DataFrame(index=indicator_df.index)
out_df.reset_index(level=0, inplace=True)
Replace NaN by None and create a dict of local variables to allow a correct evaluation with pd.eval:
def eval_kpi(row):
kpi = row.filter(like='KPI_Key').fillna('None')
return pd.Series(pd.eval(kpi, local_dict=row['local_vars']), index=kpi.index)
final_out_df = indicator_df.astype(int).apply(dict, axis=1) \
.rename('local_vars').reset_index() \
.merge(input_df, how='cross')
final_out_df.update(final_out_df.apply(eval_kpi, axis=1))
final_out_df = final_out_df.drop(columns='local_vars') \
.sort_values(['KPI_ID', 'PatientID']) \
.reset_index(drop=True)
Output:
>>> final_out_df
PatientID KPI_ID KPI_Key1 KPI_Key2 KPI_Key3
0 1 A 0.0 1.0 75.0
1 2 A 0.0 0.0 75.0
2 3 A 2.0 0.0 75.0
3 4 A 1.0 0.0 75.0
4 5 A 2.0 1.0 75.0
5 1 B 0.0 0.0 NaN
6 2 B 2.0 0.0 NaN
7 3 B 2.0 1.0 NaN
8 4 B 1.0 0.0 NaN
9 5 B 1.0 1.0 NaN
10 1 C 75.0 0.0 NaN
11 2 C 75.0 0.0 NaN
12 3 C 75.0 2.0 NaN
13 4 C 75.0 1.0 NaN
14 5 C 75.0 2.0 NaN
15 1 D 1408.0 75.0 NaN
16 2 D 1408.0 75.0 NaN
17 3 D 1440.0 75.0 NaN
18 4 D 1440.0 75.0 NaN
19 5 D 1440.0 75.0 NaN
20 1 E 0.0 NaN 1.0
21 2 E 0.0 NaN 0.0
22 3 E 1.0 NaN 0.0
23 4 E 1.0 NaN 0.0
24 5 E 1.0 NaN 1.0
I was able to solve it by adding:
if exp == exp:
before parsing the exp through the regex.

Summarize data from a list of pandas dataframes

I have a list of dfs, df_list:
[ CLASS IDX A B C D
0 1 1 1.0 0.0 0.0 0.0
1 1 2 1.0 0.0 0.0 0.0
2 1 3 1.0 0.0 0.0 0.0,
CLASS IDX A B C D
0 1 1 NaN NaN NaN NaN
1 1 2 1.0 0.0 0.0 0.0
2 1 3 1.0 0.0 0.0 0.0,
CLASS IDX A B C D
0 1 1 0.900 0.100 0.0 0.0
1 1 2 1.000 0.000 0.0 0.0
2 1 3 0.999 0.001 0.0 0.0]
I would like to summarize the data into one df based on conditions and values in the individual dfs. Each df has 4 columns of interest, A, B, C and D. If the value in for example column A is >= 0.1 in df_list[0], I want to print 'A' in the summary df. If two columns, for example A and B, have values >= 0.1, I want to print 'A/B'. The final summary df for this data should be:
CLASS IDX 0 1 2
0 1 1 A NaN A/B
1 1 2 A A A
2 1 3 A A A
In the summary df, the column labels (0,1,2) represent the position of the df in the df_list.
I am starting with this
for index, values in enumerate(df_list):
# summarize the data
But not sure what would be the best way to continue..
Any help greatly appreciated!
Here there is one approach
cols = ['A','B','C','D']
def join_func(df):
m = df[cols].ge(0.1)
return (df[cols].mask(m, cols)
.where(m, np.nan)
.apply(lambda x: '/'.join(x.dropna()), axis=1))
result = (df_list[0].loc[:, ['CLASS','IDX']]
.assign(**{str(i) : join_func(df)
for i, df in enumerate(df_list)}))
print(result)
CLASS IDX 0 1 2
0 1 1 A A/B
1 1 2 A A A
2 1 3 A A A

Pandas column of lists, create a row for each list element

I have a dataframe where some cells contain lists of multiple values. Rather than storing multiple
values in a cell, I'd like to expand the dataframe so that each item in the list gets its own row (with the same values in all other columns). So if I have:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'trial_num': [1, 2, 3, 1, 2, 3],
'subject': [1, 1, 1, 2, 2, 2],
'samples': [list(np.random.randn(3).round(2)) for i in range(6)]
}
)
df
Out[10]:
samples subject trial_num
0 [0.57, -0.83, 1.44] 1 1
1 [-0.01, 1.13, 0.36] 1 2
2 [1.18, -1.46, -0.94] 1 3
3 [-0.08, -4.22, -2.05] 2 1
4 [0.72, 0.79, 0.53] 2 2
5 [0.4, -0.32, -0.13] 2 3
How do I convert to long form, e.g.:
subject trial_num sample sample_num
0 1 1 0.57 0
1 1 1 -0.83 1
2 1 1 1.44 2
3 1 2 -0.01 0
4 1 2 1.13 1
5 1 2 0.36 2
6 1 3 1.18 0
# etc.
The index is not important, it's OK to set existing
columns as the index and the final ordering isn't
important.
Pandas >= 0.25
Series and DataFrame methods define a .explode() method that explodes lists into separate rows. See the docs section on Exploding a list-like column.
df = pd.DataFrame({
'var1': [['a', 'b', 'c'], ['d', 'e',], [], np.nan],
'var2': [1, 2, 3, 4]
})
df
var1 var2
0 [a, b, c] 1
1 [d, e] 2
2 [] 3
3 NaN 4
df.explode('var1')
var1 var2
0 a 1
0 b 1
0 c 1
1 d 2
1 e 2
2 NaN 3 # empty list converted to NaN
3 NaN 4 # NaN entry preserved as-is
# to reset the index to be monotonically increasing...
df.explode('var1').reset_index(drop=True)
var1 var2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 NaN 3
6 NaN 4
Note that this also handles mixed columns of lists and scalars, as well as empty lists and NaNs appropriately (this is a drawback of repeat-based solutions).
However, you should note that explode only works on a single column (for now).
P.S.: if you are looking to explode a column of strings, you need to split on a separator first, then use explode. See this (very much) related answer by me.
A bit longer than I expected:
>>> df
samples subject trial_num
0 [-0.07, -2.9, -2.44] 1 1
1 [-1.52, -0.35, 0.1] 1 2
2 [-0.17, 0.57, -0.65] 1 3
3 [-0.82, -1.06, 0.47] 2 1
4 [0.79, 1.35, -0.09] 2 2
5 [1.17, 1.14, -1.79] 2 3
>>>
>>> s = df.apply(lambda x: pd.Series(x['samples']),axis=1).stack().reset_index(level=1, drop=True)
>>> s.name = 'sample'
>>>
>>> df.drop('samples', axis=1).join(s)
subject trial_num sample
0 1 1 -0.07
0 1 1 -2.90
0 1 1 -2.44
1 1 2 -1.52
1 1 2 -0.35
1 1 2 0.10
2 1 3 -0.17
2 1 3 0.57
2 1 3 -0.65
3 2 1 -0.82
3 2 1 -1.06
3 2 1 0.47
4 2 2 0.79
4 2 2 1.35
4 2 2 -0.09
5 2 3 1.17
5 2 3 1.14
5 2 3 -1.79
If you want sequential index, you can apply reset_index(drop=True) to the result.
update:
>>> res = df.set_index(['subject', 'trial_num'])['samples'].apply(pd.Series).stack()
>>> res = res.reset_index()
>>> res.columns = ['subject','trial_num','sample_num','sample']
>>> res
subject trial_num sample_num sample
0 1 1 0 1.89
1 1 1 1 -2.92
2 1 1 2 0.34
3 1 2 0 0.85
4 1 2 1 0.24
5 1 2 2 0.72
6 1 3 0 -0.96
7 1 3 1 -2.72
8 1 3 2 -0.11
9 2 1 0 -1.33
10 2 1 1 3.13
11 2 1 2 -0.65
12 2 2 0 0.10
13 2 2 1 0.65
14 2 2 2 0.15
15 2 3 0 0.64
16 2 3 1 -0.10
17 2 3 2 -0.76
UPDATE: the solution below was helpful for older Pandas versions, because the DataFrame.explode() wasn’t available. Starting from Pandas 0.25.0 you can simply use DataFrame.explode().
lst_col = 'samples'
r = pd.DataFrame({
col:np.repeat(df[col].values, df[lst_col].str.len())
for col in df.columns.drop(lst_col)}
).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns]
Result:
In [103]: r
Out[103]:
samples subject trial_num
0 0.10 1 1
1 -0.20 1 1
2 0.05 1 1
3 0.25 1 2
4 1.32 1 2
5 -0.17 1 2
6 0.64 1 3
7 -0.22 1 3
8 -0.71 1 3
9 -0.03 2 1
10 -0.65 2 1
11 0.76 2 1
12 1.77 2 2
13 0.89 2 2
14 0.65 2 2
15 -0.98 2 3
16 0.65 2 3
17 -0.30 2 3
PS here you may find a bit more generic solution
UPDATE: some explanations: IMO the easiest way to understand this code is to try to execute it step-by-step:
in the following line we are repeating values in one column N times where N - is the length of the corresponding list:
In [10]: np.repeat(df['trial_num'].values, df[lst_col].str.len())
Out[10]: array([1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 3, 3, 3], dtype=int64)
this can be generalized for all columns, containing scalar values:
In [11]: pd.DataFrame({
...: col:np.repeat(df[col].values, df[lst_col].str.len())
...: for col in df.columns.drop(lst_col)}
...: )
Out[11]:
trial_num subject
0 1 1
1 1 1
2 1 1
3 2 1
4 2 1
5 2 1
6 3 1
.. ... ...
11 1 2
12 2 2
13 2 2
14 2 2
15 3 2
16 3 2
17 3 2
[18 rows x 2 columns]
using np.concatenate() we can flatten all values in the list column (samples) and get a 1D vector:
In [12]: np.concatenate(df[lst_col].values)
Out[12]: array([-1.04, -0.58, -1.32, 0.82, -0.59, -0.34, 0.25, 2.09, 0.12, 0.83, -0.88, 0.68, 0.55, -0.56, 0.65, -0.04, 0.36, -0.31])
putting all this together:
In [13]: pd.DataFrame({
...: col:np.repeat(df[col].values, df[lst_col].str.len())
...: for col in df.columns.drop(lst_col)}
...: ).assign(**{lst_col:np.concatenate(df[lst_col].values)})
Out[13]:
trial_num subject samples
0 1 1 -1.04
1 1 1 -0.58
2 1 1 -1.32
3 2 1 0.82
4 2 1 -0.59
5 2 1 -0.34
6 3 1 0.25
.. ... ... ...
11 1 2 0.68
12 2 2 0.55
13 2 2 -0.56
14 2 2 0.65
15 3 2 -0.04
16 3 2 0.36
17 3 2 -0.31
[18 rows x 3 columns]
using pd.DataFrame()[df.columns] will guarantee that we are selecting columns in the original order...
you can also use pd.concat and pd.melt for this:
>>> objs = [df, pd.DataFrame(df['samples'].tolist())]
>>> pd.concat(objs, axis=1).drop('samples', axis=1)
subject trial_num 0 1 2
0 1 1 -0.49 -1.00 0.44
1 1 2 -0.28 1.48 2.01
2 1 3 -0.52 -1.84 0.02
3 2 1 1.23 -1.36 -1.06
4 2 2 0.54 0.18 0.51
5 2 3 -2.18 -0.13 -1.35
>>> pd.melt(_, var_name='sample_num', value_name='sample',
... value_vars=[0, 1, 2], id_vars=['subject', 'trial_num'])
subject trial_num sample_num sample
0 1 1 0 -0.49
1 1 2 0 -0.28
2 1 3 0 -0.52
3 2 1 0 1.23
4 2 2 0 0.54
5 2 3 0 -2.18
6 1 1 1 -1.00
7 1 2 1 1.48
8 1 3 1 -1.84
9 2 1 1 -1.36
10 2 2 1 0.18
11 2 3 1 -0.13
12 1 1 2 0.44
13 1 2 2 2.01
14 1 3 2 0.02
15 2 1 2 -1.06
16 2 2 2 0.51
17 2 3 2 -1.35
last, if you need you can sort base on the first the first three columns.
Trying to work through Roman Pekar's solution step-by-step to understand it better, I came up with my own solution, which uses melt to avoid some of the confusing stacking and index resetting. I can't say that it's obviously a clearer solution though:
items_as_cols = df.apply(lambda x: pd.Series(x['samples']), axis=1)
# Keep original df index as a column so it's retained after melt
items_as_cols['orig_index'] = items_as_cols.index
melted_items = pd.melt(items_as_cols, id_vars='orig_index',
var_name='sample_num', value_name='sample')
melted_items.set_index('orig_index', inplace=True)
df.merge(melted_items, left_index=True, right_index=True)
Output (obviously we can drop the original samples column now):
samples subject trial_num sample_num sample
0 [1.84, 1.05, -0.66] 1 1 0 1.84
0 [1.84, 1.05, -0.66] 1 1 1 1.05
0 [1.84, 1.05, -0.66] 1 1 2 -0.66
1 [-0.24, -0.9, 0.65] 1 2 0 -0.24
1 [-0.24, -0.9, 0.65] 1 2 1 -0.90
1 [-0.24, -0.9, 0.65] 1 2 2 0.65
2 [1.15, -0.87, -1.1] 1 3 0 1.15
2 [1.15, -0.87, -1.1] 1 3 1 -0.87
2 [1.15, -0.87, -1.1] 1 3 2 -1.10
3 [-0.8, -0.62, -0.68] 2 1 0 -0.80
3 [-0.8, -0.62, -0.68] 2 1 1 -0.62
3 [-0.8, -0.62, -0.68] 2 1 2 -0.68
4 [0.91, -0.47, 1.43] 2 2 0 0.91
4 [0.91, -0.47, 1.43] 2 2 1 -0.47
4 [0.91, -0.47, 1.43] 2 2 2 1.43
5 [-1.14, -0.24, -0.91] 2 3 0 -1.14
5 [-1.14, -0.24, -0.91] 2 3 1 -0.24
5 [-1.14, -0.24, -0.91] 2 3 2 -0.91
For those looking for a version of Roman Pekar's answer that avoids manual column naming:
column_to_explode = 'samples'
res = (df
.set_index([x for x in df.columns if x != column_to_explode])[column_to_explode]
.apply(pd.Series)
.stack()
.reset_index())
res = res.rename(columns={
res.columns[-2]:'exploded_{}_index'.format(column_to_explode),
res.columns[-1]: '{}_exploded'.format(column_to_explode)})
I found the easiest way was to:
Convert the samples column into a DataFrame
Joining with the original df
Melting
Shown here:
df.samples.apply(lambda x: pd.Series(x)).join(df).\
melt(['subject','trial_num'],[0,1,2],var_name='sample')
subject trial_num sample value
0 1 1 0 -0.24
1 1 2 0 0.14
2 1 3 0 -0.67
3 2 1 0 -1.52
4 2 2 0 -0.00
5 2 3 0 -1.73
6 1 1 1 -0.70
7 1 2 1 -0.70
8 1 3 1 -0.29
9 2 1 1 -0.70
10 2 2 1 -0.72
11 2 3 1 1.30
12 1 1 2 -0.55
13 1 2 2 0.10
14 1 3 2 -0.44
15 2 1 2 0.13
16 2 2 2 -1.44
17 2 3 2 0.73
It's worth noting that this may have only worked because each trial has the same number of samples (3). Something more clever may be necessary for trials of different sample sizes.
import pandas as pd
df = pd.DataFrame([{'Product': 'Coke', 'Prices': [100,123,101,105,99,94,98]},{'Product': 'Pepsi', 'Prices': [101,104,104,101,99,99,99]}])
print(df)
df = df.assign(Prices=df.Prices.str.split(',')).explode('Prices')
print(df)
Try this in pandas >=0.25 version
Very late answer but I want to add this:
A fast solution using vanilla Python that also takes care of the sample_num column in OP's example. On my own large dataset with over 10 million rows and a result with 28 million rows this only takes about 38 seconds. The accepted solution completely breaks down with that amount of data and leads to a memory error on my system that has 128GB of RAM.
df = df.reset_index(drop=True)
lstcol = df.lstcol.values
lstcollist = []
indexlist = []
countlist = []
for ii in range(len(lstcol)):
lstcollist.extend(lstcol[ii])
indexlist.extend([ii]*len(lstcol[ii]))
countlist.extend([jj for jj in range(len(lstcol[ii]))])
df = pd.merge(df.drop("lstcol",axis=1),pd.DataFrame({"lstcol":lstcollist,"lstcol_num":countlist},
index=indexlist),left_index=True,right_index=True).reset_index(drop=True)
Also very late, but here is an answer from Karvy1 that worked well for me if you don't have pandas >=0.25 version: https://stackoverflow.com/a/52511166/10740287
For the example above you may write:
data = [(row.subject, row.trial_num, sample) for row in df.itertuples() for sample in row.samples]
data = pd.DataFrame(data, columns=['subject', 'trial_num', 'samples'])
Speed test:
%timeit data = pd.DataFrame([(row.subject, row.trial_num, sample) for row in df.itertuples() for sample in row.samples], columns=['subject', 'trial_num', 'samples'])
1.33 ms ± 74.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit data = df.set_index(['subject', 'trial_num'])['samples'].apply(pd.Series).stack().reset_index()
4.9 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit data = pd.DataFrame({col:np.repeat(df[col].values, df['samples'].str.len())for col in df.columns.drop('samples')}).assign(**{'samples':np.concatenate(df['samples'].values)})
1.38 ms ± 25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Categories

Resources