Group column values into pairs - pandas - python

I have a df:
idx pairs
1 ['000001.jpg', '000002.jpg']
2 ['000006.jpg', '000007.jpg', '000008.jpg', '000004.jpg', '000005.jpg', '000003.jpg']
3 ['000016.jpg', '000020.jpg', '000017.jpg', '000010.jpg', '000011.jpg', '000012.jpg'...]
the pairs can have any length of list. I'd like to create a new df of the 'pairs' into a pair where the first part is always the first one in the list of pairs. E.g.:
idx pairs
1 ['000001.jpg', '000002.jpg']
2 ['000006.jpg', '000007.jpg']
3 ['000006.jpg', '000008.jpg']
4 ['000006.jpg', '000004.jpg']
5 ['000006.jpg', '000005.jpg']
6 ['000006.jpg', '000003.jpg']
7 ['000016.jpg', '000020.jpg']
8 ['000016.jpg', '000017.jpg']
9 ['000016.jpg', '000010.jpg']
10 ['000016.jpg', '000011.jpg']
11 ['000016.jpg', '000012.jpg']

Seems like a great case for explode.
df['first'] = df.pairs.apply(lambda x: x[0])
df['others'] = df.pairs.apply(lambda x: x[1:])
df = df.explode('others')[['first', 'others']]
df = pd.DataFrame({'pairs': df.values.tolist()})
df = df.rename_axis('idx').reset_index()
df.idx += 1
Then the head of df will look like this:
idx pairs
0 1 [000001.jpg, 000002.jpg]
1 2 [000006.jpg, 000007.jpg]
2 3 [000006.jpg, 000008.jpg]
3 4 [000006.jpg, 000004.jpg]
4 5 [000006.jpg, 000005.jpg]
5 6 [000006.jpg, 000003.jpg]

Here's one using a nested list comprehension and reconstructing the dataframe:
from itertools import chain
l = [[[i,j_] for j_ in j] for i, *j in df.pairs]
print(pd.DataFrame(chain.from_iterable(l)))
0 1
0 000001.jpg 000002.jpg
1 000006.jpg 000007.jpg
2 000006.jpg 000008.jpg
3 000006.jpg 000004.jpg
4 000006.jpg 000005.jpg
5 000006.jpg 000003.jpg
6 000016.jpg 000020.jpg
7 000016.jpg 000017.jpg
8 000016.jpg 000010.jpg
9 000016.jpg 000011.jpg
10 000016.jpg 000012.jpg

You can use the 'str' method and 'explode':
df['first']= df.pairs.str[0]
df['pairs']= df.pairs.str[1:]
pairs first
0 [000002.jpg] 000001.jpg
1 [000007.jpg, 000008.jpg, 000004.jpg, 000005.jp... 000006.jpg
2 [000020.jpg, 000017.jpg, 000010.jpg, 000011.jp... 000016.jpg
df= df.explode("pairs")
pairs first
0 000002.jpg 000001.jpg
1 000007.jpg 000006.jpg
1 000008.jpg 000006.jpg
1 000004.jpg 000006.jpg
1 000005.jpg 000006.jpg
1 000003.jpg 000006.jpg
2 000020.jpg 000016.jpg
2 000017.jpg 000016.jpg
2 000010.jpg 000016.jpg
2 000011.jpg 000016.jpg
2 000012.jpg 000016.jpg
Swap the cols:
df= df.reindex(columns=["first","pairs"])
If you need one list in each row:
df= df.reset_index(drop=True).agg(list,axis=1)
0 [000001.jpg, 000002.jpg]
1 [000006.jpg, 000007.jpg]
2 [000006.jpg, 000008.jpg]
3 [000006.jpg, 000004.jpg]
4 [000006.jpg, 000005.jpg]
5 [000006.jpg, 000003.jpg]
6 [000016.jpg, 000020.jpg]
7 [000016.jpg, 000017.jpg]
8 [000016.jpg, 000010.jpg]
9 [000016.jpg, 000011.jpg]
10 [000016.jpg, 000012.jpg]
dtype: object
Edit:
'reindex' is not necessary in this case:
df['tail']= df.pairs.str[1:]
df['pairs']= df.pairs.str[0]
df= df.explode("tail")
etc.
]

Related

Drop rows if value in column changes

Assume I have the following pandas data frame:
my_class value
0 1 1
1 1 2
2 1 3
3 2 4
4 2 5
5 2 6
6 2 7
7 2 8
8 2 9
9 3 10
10 3 11
11 3 12
I want to identify the indices of "my_class" where the class changes and remove n rows after and before this index. The output of this example (with n=2) should look like:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12
My approach:
# where class changes happen
s = df['my_class'].ne(df['my_class'].shift(-1).fillna(df['my_class']))
# mask with `bfill` and `ffill`
df[~(s.where(s).bfill(limit=1).ffill(limit=2).eq(1))]
Output:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12
One of possible solutions is to:
Make use of the fact that the index contains consecutive integers.
Find index values where class changes.
For each such index generate a sequence of indices from n-2
to n+1 and concatenate them.
Retrieve rows with indices not in this list.
The code to do it is:
ind = df[df['my_class'].diff().fillna(0, downcast='infer') == 1].index
df[~df.index.isin([item for sublist in
[ range(i-2, i+2) for i in ind ] for item in sublist])]
my_class = np.array([1] * 3 + [2] * 6 + [3] * 3)
cols = np.c_[my_class, np.arange(len(my_class)) + 1]
df = pd.DataFrame(cols, columns=['my_class', 'value'])
df['diff'] = df['my_class'].diff().fillna(0)
idx2drop = []
for i in df[df['diff'] == 1].index:
idx2drop += range(i - 2, i + 2)
print(df.drop(idx_drop)[['my_class', 'value']])
Output:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12

Compute average of the pandas df conditioned on a parameter

I have the following df:
import numpy as np
import pandas as pd
a = []
for i in range(5):
tmp_df = pd.DataFrame(np.random.random((10,4)))
tmp_df['lvl'] = i
a.append(tmp_df)
df = pd.concat(a, axis=0)
df =
0 1 2 3 lvl
0 0.928623 0.868600 0.854186 0.129116 0
1 0.667870 0.901285 0.539412 0.883890 0
2 0.384494 0.697995 0.242959 0.725847 0
3 0.993400 0.695436 0.596957 0.142975 0
4 0.518237 0.550585 0.426362 0.766760 0
5 0.359842 0.417702 0.873988 0.217259 0
6 0.820216 0.823426 0.585223 0.553131 0
7 0.492683 0.401155 0.479228 0.506862 0
..............................................
3 0.505096 0.426465 0.356006 0.584958 3
4 0.145472 0.558932 0.636995 0.318406 3
5 0.957969 0.068841 0.612658 0.184291 3
6 0.059908 0.298270 0.334564 0.738438 3
7 0.662056 0.074136 0.244039 0.848246 3
8 0.997610 0.043430 0.774946 0.097294 3
9 0.795873 0.977817 0.780772 0.849418 3
0 0.577173 0.430014 0.133300 0.760223 4
1 0.916126 0.623035 0.240492 0.638203 4
2 0.165028 0.626054 0.225580 0.356118 4
3 0.104375 0.137684 0.084631 0.987290 4
4 0.934663 0.835608 0.764334 0.651370 4
5 0.743265 0.072671 0.911947 0.925644 4
6 0.212196 0.587033 0.230939 0.994131 4
7 0.945275 0.238572 0.696123 0.536136 4
8 0.989021 0.073608 0.720132 0.254656 4
9 0.513966 0.666534 0.270577 0.055597 4
I am learning neat pandas functionality and thus wondering, what is the easiest way to compute average along lvl column?
What I mean is:
(df[df.lvl ==0 ] + df[df.lvl ==1 ] + df[df.lvl ==2 ] + df[df.lvl ==3 ] + df[df.lvl ==4 ]) / 5
The desired output should be a table of shape (10,4), without the column lvl, where each element is the average of 5 elements (with lvl = [0,1,2,3,4]. I hope it helps.
I think need:
np.random.seed(456)
a = []
for i in range(5):
tmp_df = pd.DataFrame(np.random.random((10,4)))
tmp_df['lvl'] = i
a.append(tmp_df)
df = pd.concat(a, axis=0)
#print (df)
df1 = (df[df.lvl ==0 ] + df[df.lvl ==1 ] +
df[df.lvl ==2 ] + df[df.lvl ==3 ] +
df[df.lvl ==4 ]) / 5
print (df1)
0 1 2 3 lvl
0 0.411557 0.520560 0.578900 0.541576 2
1 0.253469 0.655714 0.532784 0.620744 2
2 0.468099 0.576198 0.400485 0.333533 2
3 0.620207 0.367649 0.531639 0.475587 2
4 0.699554 0.548005 0.683745 0.457997 2
5 0.322487 0.316137 0.489660 0.362146 2
6 0.430058 0.159712 0.631610 0.641141 2
7 0.399944 0.511944 0.346402 0.754591 2
8 0.400190 0.373925 0.340727 0.407988 2
9 0.502879 0.399614 0.321710 0.715812 2
df = df.set_index('lvl')
df2 = df.groupby(df.groupby('lvl').cumcount()).mean()
print (df2)
0 1 2 3
0 0.411557 0.520560 0.578900 0.541576
1 0.253469 0.655714 0.532784 0.620744
2 0.468099 0.576198 0.400485 0.333533
3 0.620207 0.367649 0.531639 0.475587
4 0.699554 0.548005 0.683745 0.457997
5 0.322487 0.316137 0.489660 0.362146
6 0.430058 0.159712 0.631610 0.641141
7 0.399944 0.511944 0.346402 0.754591
8 0.400190 0.373925 0.340727 0.407988
9 0.502879 0.399614 0.321710 0.715812
EDIT:
If each subset of DataFrame have index from 0 to len(subset):
df2 = df.mean(level=0)
print (df2)
0 1 2 3 lvl
0 0.411557 0.520560 0.578900 0.541576 2
1 0.253469 0.655714 0.532784 0.620744 2
2 0.468099 0.576198 0.400485 0.333533 2
3 0.620207 0.367649 0.531639 0.475587 2
4 0.699554 0.548005 0.683745 0.457997 2
5 0.322487 0.316137 0.489660 0.362146 2
6 0.430058 0.159712 0.631610 0.641141 2
7 0.399944 0.511944 0.346402 0.754591 2
8 0.400190 0.373925 0.340727 0.407988 2
9 0.502879 0.399614 0.321710 0.715812 2
The groupby function is exactly what you want. It will group based on a condition, in this case where 'lvl' is the same, and then apply the mean function to the values for each column in that group.
df.groupby('lvl').mean()
it seems like you want to group by the index and take average of all the columns except lvl
i.e.
df.groupby(df.index)[[0,1,2,3]].mean()
For a dataframe generated using
np.random.seed(456)
a = []
for i in range(5):
tmp_df = pd.DataFrame(np.random.random((10,4)))
tmp_df['lvl'] = i
a.append(tmp_df)
df = pd.concat(a, axis=0)
df.groupby(df.index)[[0,1,2,3]].mean()
outputs:
0 1 2 3
0 0.411557 0.520560 0.578900 0.541576
1 0.253469 0.655714 0.532784 0.620744
2 0.468099 0.576198 0.400485 0.333533
3 0.620207 0.367649 0.531639 0.475587
4 0.699554 0.548005 0.683745 0.457997
5 0.322487 0.316137 0.489660 0.362146
6 0.430058 0.159712 0.631610 0.641141
7 0.399944 0.511944 0.346402 0.754591
8 0.400190 0.373925 0.340727 0.407988
9 0.502879 0.399614 0.321710 0.715812
which is identical to the output from
df.groupby(df.groupby('lvl').cumcount()).mean()
without resorting to double groupby.
IMO this is cleaner to read and will for large dataframe, will be much faster.

Converting list of 2D Panda's DataFrame to 3D DataFrame

I am trying to create a Pandas DataFrame that holds label values to a 2D DataFrame. This is what I have done so far:
I am reading csv files using pd.read_csv and appending them to list, for the purpose of this question let's consider the following code:
import numpy as np
import pandas as pd
raw_sample = []
labels = [1,1,1,2,2,2]
samples = np.random.randn(6, 5, 4)
for contents in range(samples.shape[0]):
raw_sample.append(pd.DataFrame(samples[contents]))
Then, I added raw_sample to df=d.DataFrame(raw_sample). Then I added the labels to df by doing the following:
df = df.set_index([df.index, labels])
df.index = df.index.set_names('index', level=0)
df.index = df.index.set_names('labels', level=1)
I tried printing this and I got
0
index labels
0 1 0 1 2 3
0 0...
1 1 0 1 2 3
0 0...
2 1 0 1 2 3
0 1...
3 2 0 1 2 3
0 -0...
4 2 0 1 2 3
0 0...
5 2 0 1 2 3
0 -0...
I have also tried printing df[0], I still got the same thing.
I wanted to know if it is in the form of
index labels 0
0 1 1 2 3 4 5 6 7
3 5 6 7 9 5 4
3 4 5 6 7 8 9
1 1 4 3 2 4 5 6 7
3 5 6 7 4 5 6
2 3 4 3 4 5 3
...
I know that a DataFrame cannot take 2D array, the other thing was to use pd.Panel, for this I converted all the contents of raw_sample to numpy array and then converted raw_sample itself to numpy array and did the following:
p1 = pd.Panel(samples, items=map(str, labels))
but when I print this, I get
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: 1 to 2
Major_axis axis: 0 to 4
Minor_axis axis: 0 to 3
Looking at the Items, it looks like all the common values are grouped together.
I am not sure what to do at this point. Help!!
Update
Inputs:
labels = [1,1,1,2,2,2]
samples = [5x4 pd.DataFrame, 5x4 pd.DataFrame, 5x4 pd.DataFrame, 5x4 pd.DataFrame, 5x4 pd.DataFrame, 5x4 pd.DataFrame]
Desired Output:
index labels samples
0 1 1 2 3 4 5 6 7
3 5 6 7 9 5 4
3 4 5 6 7 8 9
1 1 4 3 2 4 5 6 7
3 5 6 7 4 5 6
2 3 4 3 4 5 3
...
If select with not unique items, get another Panel:
np.random.seed(10)
labels = [1,1,1,2,2,2]
samples = np.random.randn(6, 5, 4)
p1 = pd.Panel(samples, items=map(str, labels))
print (p1)
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: 1 to 2
Major_axis axis: 0 to 4
Minor_axis axis: 0 to 3
print (p1['1'])
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: 1 to 1
Major_axis axis: 0 to 4
Minor_axis axis: 0 to 3
print (p1.to_frame())
1 1 1 2 2 2
major minor
0 0 1.331587 1.331587 1.331587 -0.232182 -0.232182 -0.232182
1 0.715279 0.715279 0.715279 -0.501729 -0.501729 -0.501729
2 -1.545400 -1.545400 -1.545400 1.128785 1.128785 1.128785
3 -0.008384 -0.008384 -0.008384 -0.697810 -0.697810 -0.697810
1 0 0.621336 0.621336 0.621336 -0.081122 -0.081122 -0.081122
1 -0.720086 -0.720086 -0.720086 -0.529296 -0.529296 -0.529296
2 0.265512 0.265512 0.265512 1.046183 1.046183 1.046183
3 0.108549 0.108549 0.108549 -1.418556 -1.418556 -1.418556
2 0 0.004291 0.004291 0.004291 -0.362499 -0.362499 -0.362499
1 -0.174600 -0.174600 -0.174600 -0.121906 -0.121906 -0.121906
2 0.433026 0.433026 0.433026 0.319356 0.319356 0.319356
3 1.203037 1.203037 1.203037 0.460903 0.460903 0.460903
3 0 -0.965066 -0.965066 -0.965066 -0.215790 -0.215790 -0.215790
1 1.028274 1.028274 1.028274 0.989072 0.989072 0.989072
2 0.228630 0.228630 0.228630 0.314754 0.314754 0.314754
3 0.445138 0.445138 0.445138 2.467651 2.467651 2.467651
4 0 -1.136602 -1.136602 -1.136602 -1.508321 -1.508321 -1.508321
1 0.135137 0.135137 0.135137 0.620601 0.620601 0.620601
2 1.484537 1.484537 1.484537 -1.045133 -1.045133 -1.045133
3 -1.079805 -1.079805 -1.079805 -0.798009 -0.798009 -0.798009
But if have unique one, get DataFrame:
np.random.seed(10)
labels = list('abcdef')
samples = np.random.randn(6, 5, 4)
p1 = pd.Panel(samples, items=labels)
print (p1)
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: a to f
Major_axis axis: 0 to 4
Minor_axis axis: 0 to 3
print (p1['a'])
0 1 2 3
0 1.331587 0.715279 -1.545400 -0.008384
1 0.621336 -0.720086 0.265512 0.108549
2 0.004291 -0.174600 0.433026 1.203037
3 -0.965066 1.028274 0.228630 0.445138
4 -1.136602 0.135137 1.484537 -1.079805
print (p1.to_frame())
a b c d e f
major minor
0 0 1.331587 -1.977728 0.660232 -0.232182 1.985085 0.117476
1 0.715279 -1.743372 -0.350872 -0.501729 1.744814 -1.907457
2 -1.545400 0.266070 -0.939433 1.128785 -1.856185 -0.922909
3 -0.008384 2.384967 -0.489337 -0.697810 -0.222774 0.469751
1 0 0.621336 1.123691 -0.804591 -0.081122 -0.065848 -0.144367
1 -0.720086 1.672622 -0.212698 -0.529296 -2.131712 -0.400138
2 0.265512 0.099149 -0.339140 1.046183 -0.048831 -0.295984
3 0.108549 1.397996 0.312170 -1.418556 0.393341 0.848209
2 0 0.004291 -0.271248 0.565153 -0.362499 0.217265 0.706830
1 -0.174600 0.613204 -0.147420 -0.121906 -1.994394 -0.787269
2 0.433026 -0.267317 -0.025905 0.319356 1.107708 0.292941
3 1.203037 -0.549309 0.289094 0.460903 0.244544 -0.470807
3 0 -0.965066 0.132708 -0.539879 -0.215790 -0.061912 2.404326
1 1.028274 -0.476142 0.708160 0.989072 -0.753893 -0.739357
2 0.228630 1.308473 0.842225 0.314754 0.711959 -0.312829
3 0.445138 0.195013 0.203581 2.467651 0.918269 -0.348882
4 0 -1.136602 0.400210 2.394704 -1.508321 -0.482093 -0.439026
1 0.135137 -0.337632 0.917459 0.620601 0.089588 0.141104
2 1.484537 1.256472 -0.112272 -1.045133 0.826999 0.273049
3 -1.079805 -0.731970 -0.362180 -0.798009 -1.954512 -1.618571
It is same as in DataFrame with non unique columns:
samples = np.random.randn(6, 5)
df = pd.DataFrame(samples, columns=list('11122'))
print (df)
1 1 1 2 2
0 0.346338 -0.855797 -0.932463 -2.289259 0.634696
1 0.272794 -0.924357 -1.898270 -0.743083 -1.587480
2 -0.519975 -0.136836 0.530178 -0.730629 2.520821
3 0.137530 -1.232763 0.508548 -0.480384 -1.213064
4 -0.157787 -1.600004 -1.287620 0.384642 -0.568072
5 -0.649427 -0.659585 -0.813359 -1.487412 -0.044206
print (df['1'])
1 1 1
0 0.346338 -0.855797 -0.932463
1 0.272794 -0.924357 -1.898270
2 -0.519975 -0.136836 0.530178
3 0.137530 -1.232763 0.508548
4 -0.157787 -1.600004 -1.287620
5 -0.649427 -0.659585 -0.813359
EDIT:
Also for creating df from list need unique labels (no unique raise error) and function concat with parameter keys, for Panel call to_panel:
np.random.seed(100)
raw_sample = []
labels = list('abcdef')
samples = np.random.randn(6, 5, 4)
for contents in range(samples.shape[0]):
raw_sample.append(pd.DataFrame(samples[contents]))
df = pd.concat(raw_sample, keys=labels)
print (df)
0 1 2 3
a 0 -1.749765 0.342680 1.153036 -0.252436
1 0.981321 0.514219 0.221180 -1.070043
2 -0.189496 0.255001 -0.458027 0.435163
3 -0.583595 0.816847 0.672721 -0.104411
4 -0.531280 1.029733 -0.438136 -1.118318
b 0 1.618982 1.541605 -0.251879 -0.842436
1 0.184519 0.937082 0.731000 1.361556
2 -0.326238 0.055676 0.222400 -1.443217
3 -0.756352 0.816454 0.750445 -0.455947
4 1.189622 -1.690617 -1.356399 -1.232435
c 0 -0.544439 -0.668172 0.007315 -0.612939
1 1.299748 -1.733096 -0.983310 0.357508
2 -1.613579 1.470714 -1.188018 -0.549746
3 -0.940046 -0.827932 0.108863 0.507810
4 -0.862227 1.249470 -0.079611 -0.889731
d 0 -0.881798 0.018639 0.237845 0.013549
1 -1.635529 -1.044210 0.613039 0.736205
2 1.026921 -1.432191 -1.841188 0.366093
3 -0.331777 -0.689218 2.034608 -0.550714
4 0.750453 -1.306992 0.580573 -1.104523
e 0 0.690121 0.686890 -1.566688 0.904974
1 0.778822 0.428233 0.108872 0.028284
2 -0.578826 -1.199451 -1.705952 0.369164
3 1.876573 -0.376903 1.831936 0.003017
4 -0.076023 0.003958 -0.185014 -2.487152
f 0 -1.704651 -1.136261 -2.973315 0.033317
1 -0.248889 -0.450176 0.132428 0.022214
2 0.317368 -0.752414 -1.296392 0.095139
3 -0.423715 -1.185984 -0.365462 -1.271023
4 1.586171 0.693391 -1.958081 -0.134801
p1 = df.to_panel()
print (p1)
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 6 (major_axis) x 5 (minor_axis)
Items axis: 0 to 3
Major_axis axis: a to f
Minor_axis axis: 0 to 4
EDIT1:
If need MultiIndex DataFrame is possible create helper range for unique values, use concat and last remove helper level of MultiIndex:
np.random.seed(100)
raw_sample = []
labels = [1,1,1,2,2,2]
mux = pd.MultiIndex.from_arrays([labels, range(len(labels))])
samples = np.random.randn(6, 5, 4)
for contents in range(samples.shape[0]):
raw_sample.append(pd.DataFrame(samples[contents]))
df = pd.concat(raw_sample, keys=mux)
df = df.reset_index(level=1, drop=True)
print (df)
0 1 2 3
1 0 -1.749765 0.342680 1.153036 -0.252436
1 0.981321 0.514219 0.221180 -1.070043
2 -0.189496 0.255001 -0.458027 0.435163
3 -0.583595 0.816847 0.672721 -0.104411
4 -0.531280 1.029733 -0.438136 -1.118318
0 1.618982 1.541605 -0.251879 -0.842436
1 0.184519 0.937082 0.731000 1.361556
2 -0.326238 0.055676 0.222400 -1.443217
3 -0.756352 0.816454 0.750445 -0.455947
4 1.189622 -1.690617 -1.356399 -1.232435
0 -0.544439 -0.668172 0.007315 -0.612939
1 1.299748 -1.733096 -0.983310 0.357508
2 -1.613579 1.470714 -1.188018 -0.549746
3 -0.940046 -0.827932 0.108863 0.507810
4 -0.862227 1.249470 -0.079611 -0.889731
2 0 -0.881798 0.018639 0.237845 0.013549
1 -1.635529 -1.044210 0.613039 0.736205
2 1.026921 -1.432191 -1.841188 0.366093
3 -0.331777 -0.689218 2.034608 -0.550714
4 0.750453 -1.306992 0.580573 -1.104523
0 0.690121 0.686890 -1.566688 0.904974
1 0.778822 0.428233 0.108872 0.028284
2 -0.578826 -1.199451 -1.705952 0.369164
3 1.876573 -0.376903 1.831936 0.003017
4 -0.076023 0.003958 -0.185014 -2.487152
0 -1.704651 -1.136261 -2.973315 0.033317
1 -0.248889 -0.450176 0.132428 0.022214
2 0.317368 -0.752414 -1.296392 0.095139
3 -0.423715 -1.185984 -0.365462 -1.271023
4 1.586171 0.693391 -1.958081 -0.134801
But create panel is not possible:
p1 = df.to_panel()
print (p1)
>ValueError: Can't convert non-uniquely indexed DataFrame to Panel

Divide part of a dataframe by another while keeping columns that are not being divided

I have two data frames as below:
Sample_name C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer
0 1 1 0.161456 0.033139 0.991840 2.111023 0.846197
1 1 10 0.636140 1.024235 36.333741 16.074662 3.142135
2 1 13 0.605840 0.034337 2.085061 2.125908 0.069698
3 1 14 0.038481 0.152382 4.608259 4.960007 0.162162
4 1 5 0.035628 0.087637 1.397457 0.768467 0.052605
5 1 6 0.114375 0.020196 0.220193 7.662065 0.077727
Sample_name C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer
0 1 1 0.305224 0.542488 66.428382 73.615079 10.342252
1 1 10 0.814696 1.246165 73.802644 58.064363 11.179206
2 1 13 0.556437 0.517383 50.555948 51.913547 9.412299
3 1 14 0.314058 1.148754 56.165767 61.261950 9.142128
4 1 5 0.499129 0.460813 40.182454 41.770906 8.263437
5 1 6 0.300203 0.784065 47.359506 52.841821 9.833513
I want to divide the numerical values in the selected cells of the first by the second and I am using the following code:
df1_int.loc[:,'C14-Cer':].div(df2.loc[:,'C14-Cer':])
However, this way I lose the information from the column "Sample_name".
C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer
0 0.528977 0.061088 0.014931 0.028677 0.081819
1 0.780831 0.821909 0.492309 0.276842 0.281070
2 1.088785 0.066367 0.041243 0.040951 0.007405
3 0.122529 0.132650 0.082047 0.080964 0.017738
4 0.071381 0.190178 0.034778 0.018397 0.006366
5 0.380993 0.025759 0.004649 0.145000 0.007904
How can I perform the division while keeping the column "Sample_name" in the resulting dataframe?
You can selectively overwrite using loc, the same way that you're already performing the division:
df1_int.loc[:,'C14-Cer':] = df1_int.loc[:,'C14-Cer':].div(df2.loc[:,'C14-Cer':])
This preserves the sample_name col:
In [12]:
df.loc[:,'C14-Cer':] = df.loc[:,'C14-Cer':].div(df1.loc[:,'C14-Cer':])
df
Out[12]:
Sample_name C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer
index
0 1 1 0.528975 0.061087 0.014931 0.028677 0.081819
1 1 10 0.780831 0.821910 0.492309 0.276842 0.281070
2 1 13 1.088785 0.066367 0.041243 0.040951 0.007405
3 1 14 0.122528 0.132650 0.082047 0.080964 0.017738
4 1 5 0.071380 0.190179 0.034778 0.018397 0.006366
5 1 6 0.380992 0.025758 0.004649 0.145000 0.007904

Best way to split a DataFrame given an edge

Suppose I have the following DataFrame:
a b
0 A 1.516733
1 A 0.035646
2 A -0.942834
3 B -0.157334
4 A 2.226809
5 A 0.768516
6 B -0.015162
7 A 0.710356
8 A 0.151429
And I need to group it given the "edge B"; that means the groups will be:
a b
0 A 1.516733
1 A 0.035646
2 A -0.942834
3 B -0.157334
4 A 2.226809
5 A 0.768516
6 B -0.015162
7 A 0.710356
8 A 0.151429
That is any time I find a 'B' in the column 'a' I want to split my DataFrame.
My current solution is:
#create the dataframe
s = pd.Series(['A','A','A','B','A','A','B','A','A'])
ss = pd.Series(np.random.randn(9))
dff = pd.DataFrame({"a":s,"b":ss})
#my solution
count = 0
ls = []
for i in s:
if i=="A":
ls.append(count)
else:
ls.append(count)
count+=1
dff['grpb']=ls
and I got the dataframe:
a b grpb
0 A 1.516733 0
1 A 0.035646 0
2 A -0.942834 0
3 B -0.157334 0
4 A 2.226809 1
5 A 0.768516 1
6 B -0.015162 1
7 A 0.710356 2
8 A 0.151429 2
Which I can then split with dff.groupby('grpb').
Is there a more efficient way to do this using pandas' functions?
here's a oneliner:
zip(*dff.groupby(pd.rolling_median((1*(dff['a']=='B')).cumsum(),3,True)))[-1]
[ 1 2
0 A 1.516733
1 A 0.035646
2 A -0.942834
3 B -0.157334,
1 2
4 A 2.226809
5 A 0.768516
6 B -0.015162,
1 2
7 A 0.710356
8 A 0.151429]
How about:
df.groupby((df.a == "B").shift(1).fillna(0).cumsum())
For example:
>>> df
a b
0 A -1.957118
1 A -0.906079
2 A -0.496355
3 B 0.552072
4 A -1.903361
5 A 1.436268
6 B 0.391087
7 A -0.907679
8 A 1.672897
>>> gg = list(df.groupby((df.a == "B").shift(1).fillna(0).cumsum()))
>>> pprint.pprint(gg)
[(0,
a b
0 A -1.957118
1 A -0.906079
2 A -0.496355
3 B 0.552072),
(1, a b
4 A -1.903361
5 A 1.436268
6 B 0.391087),
(2, a b
7 A -0.907679
8 A 1.672897)]
(I didn't bother getting rid of the indices; you could use [g for k, g in df.groupby(...)] if you liked.)
An alternative is:
In [36]: dff
Out[36]:
a b
0 A 0.689785
1 A -0.374623
2 A 0.517337
3 B 1.549259
4 A 0.576892
5 A -0.833309
6 B -0.209827
7 A -0.150917
8 A -1.296696
In [37]: dff['grpb'] = np.NaN
In [38]: breaks = dff[dff.a == 'B'].index
In [39]: dff['grpb'][breaks] = range(len(breaks))
In [40]: dff.fillna(method='bfill').fillna(len(breaks))
Out[40]:
a b grpb
0 A 0.689785 0
1 A -0.374623 0
2 A 0.517337 0
3 B 1.549259 0
4 A 0.576892 1
5 A -0.833309 1
6 B -0.209827 1
7 A -0.150917 2
8 A -1.296696 2
Or using itertools to create 'grpb' is an option too.
def vGroup(dataFrame, edgeCondition, groupName='autoGroup'):
groupNum = 0
dataFrame[groupName] = ''
#loop over each row
for inx, row in dataFrame.iterrows():
if edgeCondition[inx]:
dataFrame.ix[inx, groupName] = 'edge'
groupNum += 1
else:
dataFrame.ix[inx, groupName] = groupNum
return dataFrame[groupName]
vGroup(df, df[0] == ' ')

Categories

Resources